Spam comments on YouTube videos can negatively impact both the platform and the users by undermining user experience, preventing meaningful conversation, lowering engagement, spreading misinformation and creating cybersecurity threats among others.
(Xiao.S, Liang.Q, 2024)
COMBINING TEXT, SENTIMENT AND METADATA FOR
IMPROVING SPAM DETECTION

Nowadays machine learning models can be trained on YouTube video comments to efficiently filter them between legitimate comments or hams and spams. Early studies relied on leveraging text mining for detection. For instance, Sureka proposed an approach to identify spam activity by analyzing repeated or irrelevant comments across videos. While effective, this method's limited dataset raised scalability issues (Sureka, 2010).
Subsequent research applied machine learning algorithms. For example, TubeSpam utilized decision trees, LogisticRegression, RandomForest, and support vector machines to classify spam, achieving high accuracy across models (Alberto et al., 2015). Similarly, n-gram models were employed to enhance classification accuracy by analyzing word patterns, though these studies primarily focused on textual content without considering metadata (Aiyar & Shetty, 2018).

Research question
How can combining textual content, sentiment, and metadata create a more accurate classifier for identifying spam comments?
Methodology
To address the potential for developing a better-performing spam identifier within our dataset, we leveraged Python’s machine learning libraries and loaded our YouTube comments dataset, which included labels for relevance, polarity, likes, replies, etc. We cleaned the data by replacing invalid values with neutral substitutes. To optimize the data for machine learning algorithms, we scaled numerical data (like and reply counts) using Standard Scaling, converted categorical data (polarity) into binary vectors through one-hot encoding, and leveraged TF-IDF to transform words in the comments into numerical vectors, which also aided us in calculating their importance.
We divided our dataset into training and test subsets with an 80-20 ratio, stratifying by relevance to ensure balanced class distribution.
We build different models—Random Forest, Logistic Regression, and Naive Bayes—to train on our ‘text-only features’ set and ‘combined features’ set separately and evaluate their performance. Using a tree-based model (RF), a linear model (LR), and a probabilistic model (NB) helped us to cover a wide range of complexity. By comparing the performance metrics for each set, we can compare the performance of ‘combined-feature’ models against the ‘text-only’ ones.

The result
The result show that the combined model performed better on average on all three algorithms. Regarding the best performance, the combined RandomForest and NaiveBayes models are roughly tied. The best F1 score value was combined Naive-Bayes with 0.854.
The analysis in Orange showed similar results.
The analysis in Orange showed similar results.
For all the algorithms the performance improved when the text data was combined with data about sentiment as well as the number of likes and replies.
• Figure 2 (Polarity distribution features) shows that the ham/spam distribution is not the same for the different polarity (sentiment) categories.
• Figure 3 (Relevance count) shows that most comments in the dataset are spam.
"The result show that the combined model performed better on average on all three algorithms."

Conclusions
We have combined textual content, sentiment, and metadata to create a more accurate classifier for spam detection of YouTube video comments. The results showed an improvement when the text data was combined with metadata.
• RandomForest was the best performer in terms of balanced precision-recall and F1 scores.
• RandomForest was the best performer in terms of balanced precision-recall and F1 scores.
• LogisticRegression performed robustly but benefits less from feature combinations than other models.
• Focusing on a single video dataset due to the scarcity of reliable datasets may limit our study.
• Future work may include more features, such as the comment history across different videos.
• Focusing on a single video dataset due to the scarcity of reliable datasets may limit our study.
• Future work may include more features, such as the comment history across different videos.