1 / 21

Email Spam Detection using machine Learning

Email Spam Detection using machine Learning. Lydia Song, Lauren Steimle, Xiaoxiao Xu. Outline. Introduction to Project Pre-processing Dimensionality Reduction Brief discussion of different algorithms K-nearest D ecision tree Logistic regression Naïve-Bayes Preliminary results

bazyli
Download Presentation

Email Spam Detection using machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Email Spam Detection using machine Learning Lydia Song, Lauren Steimle, XiaoxiaoXu

  2. Outline • Introduction to Project • Pre-processing • Dimensionality Reduction • Brief discussion of different algorithms • K-nearest • Decision tree • Logistic regression • Naïve-Bayes • Preliminary results • Conclusion

  3. Spam Statistics • Percentage of Spam Emails in email traffic averaged 69.9% in February 2014 Percentage of spam in email traffic Source: https://www.securelist.com/en/analysis/204792328/Spam_report_February_2014

  4. Spam vs. Ham Spam=Unwanted communication Ham=Normal communication

  5. Pre-processing Example of Spam Email Corresponding File in Data Set Spam Email in Web Browser

  6. Pre-processing • Remove meaningless words • Create a “bag of words” used in data set • Combine similar words • Create a feature matrix “service” “last” “history” Email 1 Email 2 Email m

  7. tokens= [‘your’, ‘history’, ‘shows’, ‘that’, ‘your’, ‘last’, ‘order’, ‘is’, ‘ready’, ‘for’, ‘refilling’, ‘thank’, ‘you’, ‘sam’, ‘mcfarland’, ‘customer services’] Pre-processing Example Your history shows that your last order is ready for refilling. Thank you, Sam Mcfarland Customer Services filtered_words=[ 'history', 'last', 'order', 'ready', 'refilling', 'thank', 'sam', 'mcfarland', 'customer', 'services'] “servi” “histori” “last” Email 1 bag of words=['history', 'last', 'order', 'ready', 'refill', 'thank', 'sam', 'mcfarland', 'custom', 'service'] Email 2 Email m

  8. Dimensionality Growth • Add ~100-150 features for each additional email

  9. Dimensionality Reduction • Add a requirement that words must appear in x% of all emails to be considered a feature

  10. Dimensionality Reduction-Hashing Trick • Before Hashing: 70x9403 Dimensions • After Hashing: 70x1024 Dimensions String Integer Hash Table Index Source: Jorge Stolfi, http://en.wikipedia.org/wiki/File:Hash_table_5_0_1_1_1_1_1_LL.svg#filelinks

  11. Outline • Introduction to Project • Pre-processing • Dimensionality Reduction • Brief discussion of different algorithms • K-nearest • Decision tree • Logistic regression • Naïve-Bayes • Preliminary results • Conclusion

  12. K-Nearest Neighbors • Goal: Classify an unknown training sample into one of C classes • Idea: To determine the label of an unknown sample (x), look at x’s k-nearest neighbors Image from MIT Opencourseware

  13. Decision Tree • Convert training data into a tree structure • Root node: the first decision node • Decision node: if–then decision based on features of training sample • Leaf Node: contains a class label Image from MIT Opencourseware

  14. Logistic Regression • “Regression” over training examples • Transform continuous y to prediction of 1 or 0 using the standard logistic function • Predict spam if

  15. Naïve Bayes • Use Bayes Theorem: • Hypothesis (H): spam or not spam • Event (e): word occurs • For example, the probability an email is spam when the word “free” is in the email • “Naïve”: assume the feature values are independent of each other

  16. Outline • Introduction to Project • Pre-processing • Dimensionality Reduction • Brief discussion of different algorithms • K-nearest • Decision tree • Logistic regression • Naïve-Bayes • Preliminary results • Conclusion

  17. Preliminary Results • 250 emails in training set, 50 in testing set • Use 15% as the “percentage of emails” cutoff • Performance measures: • Accuracy: % of predictions that were correct • Recall: % of spam emails that were predicted correctly • Precision: % of emails classified as spam that were actually spam • F-Score: weighted average of precision and recall

  18. “Percentage of Emails” Performance Linear Regression Logistic Regression

  19. Preliminary Results

  20. Next Steps • Implement SVM: Matlab vs. Weka • Hashing trick- try different number of buckets • Regularizations

  21. Thank you! Any questions?

More Related