1 / 0

Detecting Social Spam Campaigns on Twitter

Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented by Yingjiu Li Singapore Management University, Singapore. Detecting Social Spam Campaigns on Twitter. Background. 1. Measurement. 2. Classification. Conclusion. 5.

karma
Download Presentation

Detecting Social Spam Campaigns on Twitter

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Zi Chu & HainingWang The College of William & Mary IndraWidjaja Bell Laboratories, Alcatel-Lucent, USA Presented by Yingjiu Li Singapore Management University, Singapore

    Detecting Social Spam Campaigns on Twitter

  2. Background 1 Measurement 2 Classification Conclusion 5 3 Evaluation 4 Outline 2
  3. Background 1 Measurement 2 Classification Conclusion 5 3 Evaluation 4 Outline 3
  4. Background Twitter: micro-blogging & social networking Tweet Bob Alice Follower Friend 4
  5. Background Popularity brings spam - Spam definition: malicious / phishing / scam content or URL - Social spamming is more successful using social relationship Spam tweet <text, URL> 5
  6. Background Spam campaign - Spammer runs multiple accounts to spread spam tweets for a specific purpose (i.e. propagating a spam site, selling goods) Real case of adult pill campaign with multiple accounts 6
  7. Background Detecting spam, 1st step in fighting spam - Tweet level: spam words, spam URLs - Account level: spam tweets, aggressive automation - Campaign level: cluster related tweets/accounts into campaigns, observe collective features (similar content, posting behavior …) Efficiency - Capture multiple spam accounts at one time Robustness - Some spamming methods can’t be detected at individual level 7
  8. Related Work Existing work relies on solo URL feature - Group tweets into a campaign based on the shared URL. If the URL is blacklisted, the campaign is classified as spam. Disadvantages Blacklists have the lag effect (90% of clicks before blacklisted) Blacklists can only cover part of spam URLs False positive (whole domain bit.ly is blacklisted, benign webpage http://bit.ly/fg7Uy) False negative: the URL/website is benign, but the campaign’s collective behavior is spamming 8
  9. Background A real spam campaign example of aggressive duplication Twitter Spamming Rule: “posts duplicate content over multiple accounts” Account EldoYPISILONE Nutz this music video, SO COOL ;) http://on.fb.me/ht2wXJ?=mti0 Account MatthewVankomen Amazing this music footage, you'll like ^^ http://on.fb.me/ht2wXJ?=nzky Account KristaBauske2r Amazing this music vid, Maybe u'll like it :^ http://on.fb.me/ht2wXJ?=mtcz 9
  10. Contribution Improve the existing work based on solo URL detection Introduce new features Design an automatic detection system using machine learning 10
  11. Background 1 Measurement 2 Classification Conclusion 5 3 Evaluation 4 Outline 11
  12. Data Collection Twitter Streaming API - Spritzer - Uniform sampling, 1% of real-time global tweets Dataset, 50 million tweets - Feb. – Apr. 2011 - Only check tweets with URLs, 8 million (assume tweets without URLs are non-spam) 12
  13. Clustering Algorithm URL redirection, tweet = <content, URL> - original URL => final landing URL http://ow.ly/5UbUS ==> ... ==> http://www.people.com/people/.../020515101,00.html Cluster tweets with the same final URL into a campaign Campaign = <shared URL, tweet set, account set> Campaign_1 <shared_URL_1, {tweet_1, tweet_2, tweet_3}, {account_1, account_2}> Campaign_2 <shared_URL_2, {tweet_4, tweet_5}, {account_1, account_3}> Campaign_3 <shared_URL_3, {tweet_1, tweet_6}, {account_1, account_4}> 13
  14. Ground Truth Creation - Blacklists: Google’s SafeBrowsing, PhishTank, URIBL, SURBL, Spamhaus (If URL is blacklisted, the campaign is labeled as spam) - Manual check: content of tweets, accounts (i.e. tweets posted by them outside the campaign)… Violate Official Twitter Rules of Spam and Abuse? Ground truth set 580 legitimate campaigns 744 spam campaigns 14
  15. Twitter Rules 15
  16. Data Analysis Master URL, http://biy.ly/5As4k3 Affiliate URL<==> spam account Account_1, http://biy.ly/5As4k3?=xd56 Account_2, http://biy.ly/5As4k3?=f2kk Master URL Diversity Ratio = unique_Master_URL_# / tweet_no High ratio ==> account independence Low ratio ==> account dependence 16
  17. Data Analysis 17
  18. Data Analysis Burstiness - overall workload distribution of a campaign 18
  19. Background 1 Measurement 2 Classification Conclusion 5 3 Evaluation 4 Outline 19
  20. Classification Automatic classification framework Binary-class classification 20
  21. Feature Extractor Tweet-level Features - Tweet = <textual content, URL> - Text contains spam words? - URL is redirected? - URL is blacklisted? 21
  22. Feature Extractor Account-level Features Account = <tweets, friends/followers, account properties> Lifetime tweet count Account registration date - Account protected? Verified? - Friend_count, follower_count, ratio - Account reputation = follower_count / (follower_count + friend_count) - Account taste = avg(account reputation of each of his friend) 22
  23. Feature Extractor Campaign-level Features Campaign = ({tweets}, {accounts}, shared_URL) Account Diversity Ratio= account_no / tweet_no - Entropy of inter-arrival timing Lower: regular behavior ==> coordination Higher: irregular behavior ==> independent participation Corrected Conditional Entropy (CCE) 23
  24. Feature Extractor Content self-similarity {Tweets} => sense clusters Cluster_1) this music video so cool, amazing this music footage you'll like, this music video hope u like Cluster_2) How to Consolidate Credit Card Debt Consolidate Credit Cards Now to Become Debt Free Later Three Effective Ways to Consolidate Credit Card Debt Without Using Intermediaries 24
  25. Feature Extractor SenseClusters- cluster messages based on contextual similarity Vector space model: text ==>vector Msg_1, He visited Russia in 1996. Msg_2, In 1996 he went to Russia. … Vocabulary = {in, he, Russia, to, visited, went, 1996, …} Occurrence Matrix Weight, TF-IDF (Term Frequency – Inverse Document Frequency) 25
  26. Feature Extractor Latent Semantic Analysis, rank lowering 2nd-ordersimilarity (1st-order similarity) “Score” => a number that expresses the accomplishment of a team in a game “Goal” => a successful attempt at scoring Cosine similarity measure cos0 = 1, same cos90 = 0, orthogonal cos_sim> threshold, the same sense cluster 26
  27. Feature Extractor {Tweets} => K sense clusters (on the fly) 27
  28. Decision Maker Random Forest Ensemble classifier that consists of many decision trees Construction of each tree: calculate the best split based on m(<< M) features in the training set Prediction of a new sample is pushed down the tree. It is assigned with the label of the leaf node it ends up in Final decision – majority voting of all trees 28
  29. Background 1 Measurement 2 Classification Conclusion 5 3 Evaluation 4 Outline 29
  30. Evaluation Weka Try each classifier with the ground truth set, 10-fold cross-validation High Accuracy, Low FPR (legitimate => spam), Medium FNR (spam => legitimate) 30
  31. Evaluation Evaluate importance for every feature with Decision Tree Only use one feature for classification each time 31
  32. Background 1 Measurement 2 Classification Conclusion 5 3 Evaluation 4 Outline 32
  33. Conclusion Large measurement on Twitter Formulation of new features Automatic classification system Overall accuracy 94.5% 33
  34. Questions 34
  35. Click to edit company slogan . Thank You !
More Related