1 / 25

Learning to Detect Phishing Emails

Learning to Detect Phishing Emails. Report : 鄭志欣 Advisor: Hsing-Kuo Pao. I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference (WWW), pages 649 – 656, 2007. Outline. Introduction Method Empirical evaluation

devika
Download Presentation

Learning to Detect Phishing Emails

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to Detect Phishing Emails Report : 鄭志欣 Advisor:Hsing-Kuo Pao I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference (WWW), pages 649–656, 2007.

  2. Outline • Introduction • Method • Empirical evaluation • Conclusion

  3. Introduction • Phishing (Spoofed websites) • Stealing account information • Logon credentials • Identity information • Phishing Problem – Hard

  4. Method • PILFER – A Machine Learning based approach to classification. • phishing emails / ham (good) emails • Feature Set • Features as used in email classification

  5. Features as used in email classification • IP-based URLs: • http://192.168.0.1/paypal.cgi?fix_account • Phishing attacks are hosted off of compromised PCs. • This feature is binary.

  6. Age of linked-to domain names • Legitimate-sounding domain name • Palypal.com • paypal-update.com • These domains often have a limited life • WHOIS query • date is within 60 days of the date the email was sent – “fresh” domain. • This is a binary feature

  7. Nonmatching URLs • This is a case of a link that says paypal.com but actually links to badsite.com. • Such a link looks like <a href="badsite.com"> paypal.com</a>. • This is a binary feature.

  8. “Here” links to non-modal domain • “Click here to restore your account access” • Link with the text “link”, “click”, or “here” that links to a domain other than this “modal domain” • This is a binary feature.

  9. HTML emails • Emails are sent as either plain text, HTML, or a combination of the two - multipart/alternative format. • To launch an attack without using HTML is difficult. • This is a binary feature.

  10. Number of links • The number of links present in an email. • <a> in HTML tag • This is a continuous feature.

  11. Number of domains • Simply take the domain names previously extracted from all of the links, and simply count the number of distinct domains. • Look at the “main” part of a domain • https://www.cs.university.edu/ • http://www.company.co.jp/ • This is a continuous feature.

  12. Number of dots • Subdomains like  • http://www.my-bank.update.data.com. • Redirection script, such as • http://www.google.com/url?q=http://www.badsite.com • This feature is simply the maximum number of dots (`.') contained in any of the links present in the email, and is a continuous feature.

  13. Contains javascript • Attackers can use JavaScript to hide information from the user, and potentially launch sophisticated attacks. • An email is flagged with the “contains javascript” feature if the string “javascript” appears in the email, regardless of whether it is actually in a <script> or <a> tag • This is a binary feature.

  14. Spam-filter output • This is a binary feature, using the trained version of SpamAssassin with the default rule weights and threshold. • “Ham” or “Spam” • This is a Binary feature.

  15. Empirical Evaluation • Machine-Learning Implementation • Testing Spam Assassin • Datasets • Additional Challenges • False Positives vs. False Negatives

  16. Machine-Learning Implementation-PILFER • First, run a set of scripts to extract all the features listed. • Second , we train and test a classifier using 10-fold cross validation. • Random Forest (classifier) • Random forests create a number of decision trees and each decision tree is made by randomly choosing an attribute to split on at each level, and then pruning the tree.

  17. we use a random forest as a classifier.

  18. Testing SpamAssassin • SpamAssassin is a widely-deployed freely-available spam filter that is highly accurate in classifying spam emails. • We classify the exact same dataset using SpamAssassin version 3.1.0, using the default thresholds and rules. • Using “Untrain” SpamAssassin • “Training” on 10-fold

  19. Datasets • Two publicly available datasets. • ham corpora from the SpamAssassin project • 6950 non-phishing non-spam emails • Phishingcorpus • approximately 860 email messages

  20. Additional Challenges • The age of the dataset. • Phishing websites are short-lived. • Some of our features can therefore not be extracted from older emails, making our tests difficult. EX: Domain linked to

  21. Result

  22. Conclusion • it is possible to detect phishing emails with high accuracy by using a specialized filter, using features that are more directly applicable to phishing emails than those employed by general purpose spam filters.

  23. Reference • I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference (WWW), pages 649–656, 2007. • www.ics.uci.edu/.../Learning%20to%20Detect%20Phishing%20Emails.pptx • http://armorize-cht.blogspot.com/2010/01/phishing-mail.html

More Related