1 / 20

Filtering Spam With

Filtering Spam With. Justin Mason, SpamAssassin Project & Deersoft http://SpamAssassin.org/. What Is Spam?. Best description: "Unsolicited Bulk E-mail" In human terms: bulk e-mail you didn't want, and didn't ask for

Download Presentation

Filtering Spam With

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Filtering Spam With Justin Mason, SpamAssassin Project & Deersoft http://SpamAssassin.org/

  2. What Is Spam? • Best description: "Unsolicited Bulk E-mail" • In human terms: bulk e-mail you didn't want, and didn't ask for • Mailing lists, newsletters, "latest offers": not spam, if you asked for them in the first place • Name courtesy of Monty Python: “spam, spam, spam and spam”

  3. Why Bother Filtering Spam? • Seems to be about 30% to 60% of mail traffic, and increasing • Users are forced to waste time wading through their inbox • costs their employers money • Impossible to unsubscribe • “unsubscribe” addresses work only 37% of the time, according to the FTC • Legal retaliation not possible, yet • Just plain irritating!

  4. Spam Volume Is Increasing (data from Brightmail.com)

  5. Filtering: Homebrew Blacklists • First round of "spam filters": internal blacklists, maintained by in-house admin staff • Match addresses, and delete those from known spammers • Later, match "bad words" (Viagra, porn) • Quite hard to configure; centralised; lots of work to keep up to date

  6. Filtering: DNS Blacklists • Identify spam source computers by IP address • Allow mail system to look up a public database on the internet as mail arrives • Block the message, if its sender's address is blacklisted • Now at least 20 DNS blacklists, with varying reliability • Many false positives • eircom.net's main mail server!

  7. SpamAssassin Concepts • Zero-configuration where possible • Lots of rules to determine if a mail is spam or not • "Fuzzy logic": rules are assigned scores, based on our confidence in their accuracy • These are combined to produce an overall score for each message • If over a user-defined threshold, the mail is judged as spam • No one rule, alone, can mark a mail as spam

  8. SpamAssassin Concepts, pt.2 • Combines many systems for a "broad-spectrum" approach: • Detect forged headers • Spam-tool signatures in headers • Text keyword scanner in the message body • DNS blacklists • Razor, DCC (Distributed Checksum Clearinghouse), Pyzor • Spammers cannot aim to defeat 1 system; the others will catch them out

  9. Integration Into Mail Systems • Wrote SpamAssassin with flexibility of integration in mind • Many have been written: • Integration into Mail Transfer Agents (sendmail, qmail, Exim, Postfix, Microsoft Exchange) • Integration into virus-scanner MTA plug-ins (MIMEDefang, amavisd-new) • IMAP/POP proxies and clients • Commercial plug-ins for Windows clients (Eudora, MS Outlook) • And many more I don't know about!

  10. Accuracy and False Positives • The big issue with filtering to date: • not just “how much spam does it catch?” • but “how many legitimate mails get caught, too?” • Many systems do not pay attention to this problem • Some blacklists even use "false positives" as a weapon against service providers selling to spammers • FPs are much worse than spam getting through • much more inconvenient to user

  11. Evolving a Better Filter • SpamAssassin assigns scores using a genetic algorithm • Given a big collection of human-classified mail, determine what tests each mail triggers • Use this to "evolve" an efficient score set • Exactly the kind of problem a genetic algorithm is good at • Allows "shotgun" rules to be scored low, where they cannot do damage

  12. False Positive Rate • SpamAssassin is 98.5% accurate on our test corpora, with default settings • 0.6% false positives • 91% of all spam caught correctly • with network tests on, spam hit-rate probably increases to about 93-95% • Highest rate available among present tools • Tunable by the user -- reduce FPs by increasing the threshold, ditto vice-versa

  13. Effect of the Threshold Setting

  14. What To Do When You've Caught It • Since classifiers are imperfect, blind deletion is bad • Better to mark the mails, and allow user to check over them infrequently • Also good to mark for legal reasons • In the UK, it may be illegal to hold mail (even spam) for more than 3 days

  15. Features For Large-Scale Use: "spamd" • Client-server interface to SpamAssassin • Pre-loads, so much faster for high volumes • Can load user preferences from an SQL database • Can load-balance -- uses TCP/IP • Deployed at several large organisations and ISPs: The Well, Salon.com, Panix, Transmeta, SourceForge, Stanford

  16. Large-Scale Filtering For Your Network • Different from filtering for yourself • Many users get little spam • Should use conservative settings • Better to use “opt-out by default” • notify that spam filtering is available, and ask them if they want it

  17. How Can Network Administrators Fight Spam? • Scan for Open Relays & Proxies on your network • Block proxy ports at the firewall • Audit web servers for “FormMail” or other insecure web-to-mail scripts • Spam traps reporting to network blacklists: Razor, DCC, Pyzor • Run SpamAssassin, or SpamAssassin Pro!

  18. How Do The Spammers Feel? • Already hurting, according to CBS: • “[I’ve gone through] unbelievable hardships [to keep spamming] ... My operating costs have gone up 1,000% this year, just so I can figure out how to get around all these filters” • Spam relies on low overheads and extremely cheap delivery • Disrupt the equation and they will give up!

  19. Future Directions • Learning filters (Bayesian probability etc.) • Learn automatically, to detect what "good" mail to your network looks like • "Hash-cash" • Sending mail currently more-or-less free • With hash-cash, each recipient requires CPU time for the sender • SpamAssassin can provide "bonus points" for hash-cash users

  20. Fin • http://spamassassin.org/ • SpamAssassin for UNIX • (free software) • http://www.deersoft.com/ • SpamAssassin Pro: MS Outlook, Exchange • (commercial version) • (my employers!)

More Related