1 / 40

Spam Detection

Spam Detection. Jingrui He 10/08/2007. Spam Types. Email Spam Unsolicited commercial email Blog Spam Unwanted comments in blogs Splogs Fake blogs to boost PageRank. From Learning Point of View. Spam Detection Classification problem (ham vs. spam) Feature Extraction

elaine-key
Download Presentation

Spam Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spam Detection Jingrui He 10/08/2007

  2. Spam Types • Email Spam • Unsolicited commercial email • Blog Spam • Unwanted comments in blogs • Splogs • Fake blogs to boost PageRank

  3. From Learning Point of View • Spam Detection • Classification problem (ham vs. spam) • Feature Extraction • A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung • Fast Classifier • Relaxed Online SVMs for Spam Filtering. D. Sculley, G.M. Wachman

  4. A Learning Approach to Spam Detection based on Social Networks H.Y. Lam and D.Y. Yeung CEAS 2007

  5. Problem Statement • n Email Accounts • Sender Set: ; Receiver Set • Labeled Sender Set: s.t. • Goal • Assign the remaining account with in

  6. System Flow Chart

  7. Social Network from Logs • Directed Graph • Directed Edge • Email sent from to • Edge Weight = • is the number of emails sent from to

  8. System Flow Chart

  9. Features from Email Social Networks • In-count / Out-count • The sum of in-coming / out-going edge weights • In-degree / Out-degree • The number of email accounts that a node receives emails from / sends emails to

  10. Features from Email Social Networks • Communication Reciprocity (CR) • The percentage of interactive neighbors that a node has The set of accounts that sent emails to The set of accounts that received emails from

  11. Features from Email Social Networks • Communication Interaction Average (CIA) • The level of interaction between a sender and each of the corresponding recipients

  12. Features from Email Social Networks • Clustering Coefficient (CC) • Friends-of-friends relationship between email accounts Number of connections between neighbors of Number of neighbors of

  13. System Flow Chart

  14. Preprocessing • Sender Feature Vector • Weighted Features Problematic?

  15. System Flow Chart

  16. Assigning Spam Score • Similarity Weighted k-NN method • Gaussian similarity • Similarity weighted mean k-NN scores • Score scaling The set of k nearest neighbors

  17. Experiments • Enron Dataset: 9150 Senders • To Get • Legitimate Enron senders: email transactions within the Enron email domain • 5000 generated spam accounts • 120 senders from each class • Results Averaged over 100 Times

  18. Number of Nearest Neighbors

  19. Feature Weights (CC)

  20. Feature Weights (CIA)

  21. Feature Weights (CR)

  22. Feature Weights • In/Out-Count & In/Out-Degree • The smaller the better • Final Weights • In/Out-count & In/Out-degree: 1 • CR: 1 • CIA: 10 • CC: 15

  23. Conclusion • Legitimacy Score • No content needed • Can Be Combined with Content-Based Filters • More Sophisticated Classifiers • SVM, boosting, etc • Classifiers Using Combined Feature

  24. Relaxed Online SVMs for Spam Filtering D. Sculley and G.M. Washman SIGIR 2007

  25. Anti-Spam Controversy • Support Vector Machines (SVMs) • Academic Researchers • Statistically robust • State-of-the-art performance • Practitioners • Quadratic in the number of training examples • Impractical! • Solution: Relaxed Online SVMs

  26. Background: SVMs • Data Set = • Class Label : 1 for spam; -1 for ham • Classifier: • To Find and • Minimize: • Constraints: Tradeoff parameter Slack variable Maximizing the margin Minimizing the loss function

  27. Online SVMs

  28. Tuning the Tradeoff Parameter C • Spamassassin data set: 6034 examples Large C preferred

  29. Email Spam and SVMs • TREC05P-1: 92189 Messages • TREC06P: 37822 messages

  30. Blog Comment Spam and SVMs • Leave One Out Cross Validation • 50 Blog Posts; 1024 Comments

  31. Splogs and SVMs • Leave One Out Cross Validation • 1380 Examples

  32. Computational Cost • Online SVMs: Quadratic Training Time

  33. Relaxed Online SVMs (ROSVM) • Objective Function of SVMs: • Large C Preferred • Minimizing training error more important than maximizing the margin • ROSVM • Full margin maximization not necessary • Relax this requirement

  34. Three Ways to Relax SVMs (1) • Only Optimize Over the Recent p Examples • Dual form of SVMs • Constraints The last value found for when

  35. Three Ways to Relax SVMs (2) • Only Update on Actual Errors • Original online SVMs • Update when • ROSVM • Update when • m=0: mistake driven online SVMs • NO significant degrade in performance • Significantly reduce cost

  36. Three Ways to Relax SVMs (3) • Reduce the Number of Iterations in Interative SVMs • SMO: repeated pass over the training set to minimize the objective function • Parameter T: the maximum number of iterations • T=1: little impact on performance

  37. Testing Reduced Size

  38. Testing Reduced Iterations

  39. Testing Reduced Updates

  40. Online SVMs and ROSVM • ROSVM: Email Spam Blog Comment Spam Splog Data Set

More Related