Detecting Social Spam Campaigns on Twitter

Zi Chu & HainingWang The College of William & Mary IndraWidjaja Bell Laboratories, Alcatel-Lucent, USA Presented by Yingjiu Li Singapore Management University, Singapore
Detecting Social Spam Campaigns on Twitter

Background 1 Measurement 2 Classification Conclusion 5 3 Evaluation 4 Outline 2

Background Twitter: micro-blogging & social networking Tweet Bob Alice Follower Friend 4

Background Popularity brings spam - Spam definition: malicious / phishing / scam content or URL - Social spamming is more successful using social relationship Spam tweet <text, URL> 5

Background Spam campaign - Spammer runs multiple accounts to spread spam tweets for a specific purpose (i.e. propagating a spam site, selling goods) Real case of adult pill campaign with multiple accounts 6

Background Detecting spam, 1st step in fighting spam - Tweet level: spam words, spam URLs - Account level: spam tweets, aggressive automation - Campaign level: cluster related tweets/accounts into campaigns, observe collective features (similar content, posting behavior …) Efficiency - Capture multiple spam accounts at one time Robustness - Some spamming methods can’t be detected at individual level 7

Related Work Existing work relies on solo URL feature - Group tweets into a campaign based on the shared URL. If the URL is blacklisted, the campaign is classified as spam. Disadvantages Blacklists have the lag effect (90% of clicks before blacklisted) Blacklists can only cover part of spam URLs False positive (whole domain bit.ly is blacklisted, benign webpage http://bit.ly/fg7Uy) False negative: the URL/website is benign, but the campaign’s collective behavior is spamming 8

Background A real spam campaign example of aggressive duplication Twitter Spamming Rule: “posts duplicate content over multiple accounts” Account EldoYPISILONE Nutz this music video, SO COOL ;) http://on.fb.me/ht2wXJ?=mti0 Account MatthewVankomen Amazing this music footage, you'll like ^^ http://on.fb.me/ht2wXJ?=nzky Account KristaBauske2r Amazing this music vid, Maybe u'll like it :^ http://on.fb.me/ht2wXJ?=mtcz 9

Contribution Improve the existing work based on solo URL detection Introduce new features Design an automatic detection system using machine learning 10

Data Collection Twitter Streaming API - Spritzer - Uniform sampling, 1% of real-time global tweets Dataset, 50 million tweets - Feb. – Apr. 2011 - Only check tweets with URLs, 8 million (assume tweets without URLs are non-spam) 12

Clustering Algorithm URL redirection, tweet = <content, URL> - original URL => final landing URL http://ow.ly/5UbUS ==> ... ==> http://www.people.com/people/.../020515101,00.html Cluster tweets with the same final URL into a campaign Campaign = <shared URL, tweet set, account set> Campaign_1 <shared_URL_1, {tweet_1, tweet_2, tweet_3}, {account_1, account_2}> Campaign_2 <shared_URL_2, {tweet_4, tweet_5}, {account_1, account_3}> Campaign_3 <shared_URL_3, {tweet_1, tweet_6}, {account_1, account_4}> 13

Ground Truth Creation - Blacklists: Google’s SafeBrowsing, PhishTank, URIBL, SURBL, Spamhaus (If URL is blacklisted, the campaign is labeled as spam) - Manual check: content of tweets, accounts (i.e. tweets posted by them outside the campaign)… Violate Official Twitter Rules of Spam and Abuse? Ground truth set 580 legitimate campaigns 744 spam campaigns 14

Twitter Rules 15

Data Analysis Master URL, http://biy.ly/5As4k3 Affiliate URL<==> spam account Account_1, http://biy.ly/5As4k3?=xd56 Account_2, http://biy.ly/5As4k3?=f2kk Master URL Diversity Ratio = unique_Master_URL_# / tweet_no High ratio ==> account independence Low ratio ==> account dependence 16

Data Analysis 17

Data Analysis Burstiness - overall workload distribution of a campaign 18

Classification Automatic classification framework Binary-class classification 20

Feature Extractor Tweet-level Features - Tweet = <textual content, URL> - Text contains spam words? - URL is redirected? - URL is blacklisted? 21

Feature Extractor Account-level Features Account = <tweets, friends/followers, account properties> Lifetime tweet count Account registration date - Account protected? Verified? - Friend_count, follower_count, ratio - Account reputation = follower_count / (follower_count + friend_count) - Account taste = avg(account reputation of each of his friend) 22

Feature Extractor Campaign-level Features Campaign = ({tweets}, {accounts}, shared_URL) Account Diversity Ratio= account_no / tweet_no - Entropy of inter-arrival timing Lower: regular behavior ==> coordination Higher: irregular behavior ==> independent participation Corrected Conditional Entropy (CCE) 23

Feature Extractor Content self-similarity {Tweets} => sense clusters Cluster_1) this music video so cool, amazing this music footage you'll like, this music video hope u like Cluster_2) How to Consolidate Credit Card Debt Consolidate Credit Cards Now to Become Debt Free Later Three Effective Ways to Consolidate Credit Card Debt Without Using Intermediaries 24

Feature Extractor SenseClusters- cluster messages based on contextual similarity Vector space model: text ==>vector Msg_1, He visited Russia in 1996. Msg_2, In 1996 he went to Russia. … Vocabulary = {in, he, Russia, to, visited, went, 1996, …} Occurrence Matrix Weight, TF-IDF (Term Frequency – Inverse Document Frequency) 25

Feature Extractor Latent Semantic Analysis, rank lowering 2nd-ordersimilarity (1st-order similarity) “Score” => a number that expresses the accomplishment of a team in a game “Goal” => a successful attempt at scoring Cosine similarity measure cos0 = 1, same cos90 = 0, orthogonal cos_sim> threshold, the same sense cluster 26

Feature Extractor {Tweets} => K sense clusters (on the fly) 27

Decision Maker Random Forest Ensemble classifier that consists of many decision trees Construction of each tree: calculate the best split based on m(<< M) features in the training set Prediction of a new sample is pushed down the tree. It is assigned with the label of the leaf node it ends up in Final decision – majority voting of all trees 28

Evaluation Weka Try each classifier with the ground truth set, 10-fold cross-validation High Accuracy, Low FPR (legitimate => spam), Medium FNR (spam => legitimate) 30

Evaluation Evaluate importance for every feature with Decision Tree Only use one feature for classification each time 31

Conclusion Large measurement on Twitter Formulation of new features Automatic classification system Overall accuracy 94.5% 33

Questions 34

Click to edit company slogan . Thank You !

Detecting Social Spam Campaigns on Twitter

Detecting Social Spam Campaigns on Twitter

Presentation Transcript

Detecting Social Spam Campaigns on Twitter

Spam? Not any more !! Detecting spam emails using neural networks

Heuristics for Detecting Spam Web Pages

Detecting Spam Web Pages

Detecting Spam Zombies by Monitoring Outgoing Messages

Detecting Spammers on Social Networks

Social Awareness Campaigns:

Detecting Spammers on Twitter

Detecting Spammers on Social Networks

Detecting and Characterizing Social Spam Campaigns

DETECTING SPAMMERS ON SOCIAL NETWORKS

Improving Spam Filtering by Detecting Gray Mail

Detecting spam in a Twitter network ( 과제 세미나 ) 2012.5.11

Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter

Spam 2.0 Workshop on Digital Social Networks

Detecting Spam and Spam Responses

Detecting Spam Web Pages

Detecting Spammers on Social Networks

Detecting Web Spam with CombinedRank

Detecting and Characterizing Social Spam Campaigns

Significant factors for Detecting Redirection Spam

Social Spam

Spam 2.0 Workshop on Digital Social Networks