1 / 32

Fighting Spam: Techniques on the Table

Fighting Spam: Techniques on the Table. Cynthia Dwork Microsoft Research SVC. Why?. Huge problem Industry: costs in worker attention, infrastructure Individuals: increased ISP fees Hotmail: huge storage costs, 65-85% FTC: fraud, confidence crimes Ruining e-mail, devaluing the Internet.

adelle
Download Presentation

Fighting Spam: Techniques on the Table

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fighting Spam: Techniques on the Table Cynthia Dwork Microsoft Research SVC

  2. Why? • Huge problem • Industry: costs in worker attention, infrastructure • Individuals: increased ISP fees • Hotmail: huge storage costs, 65-85% • FTC: fraud, confidence crimes • Ruining e-mail, devaluing the Internet

  3. Desiderata • Reduce amount of spam seen by me • Don’t harm ordinary e-mail • Let me publish my e-mail address • Let me act autonomously

  4. Techniques on the Table • Filtering • Everyone: text-based • Brightmail: decoys; rules updates • Microsoft: (seeded) trainable filters [Sahami, Dumais, Heckerman, Horvitz'98] • SpamHaus, SpamCop, Osirusoft, … • IP addresses, ISPs, proxies, … • Puzzles • CPU cycles [Dwork-Naor'92] • Memory cycles [Burrows et al.] • Turing tests [Naor'96] • Pay recipient [Gates'96]

  5. Outline • Description and Discussion of • MSR’s filtering and three kinds of Puzzles • Two architectures • Odds and Ends

  6. Cost-Sensitive Filtering [SDDH’98] • Classification Scheme that • Assigns probabilities (cf. Spertus'97) • Differentiates among costs of errors • Feature space: words in message corpus, phrases, punctuation, domain type (“.edu”), etc. • Classes: Junk, Not-Junk; sex junk, etc. • Classifier: maps attribute vector to a distribution on classes

  7. Classifier Construction 500 – 2000 features of max mutual information Bayesian Classifier Naïve Bayesian Classifier

  8. Social and Technical Issues Social • Getting the user to train the filter Technical • Seed filter is language-specific • Less effective during training • Training requires many examples (?) • Spammers will adapt

  9. Computational Puzzles [DN’92] If I don't know you: Prove you spent 10s of computation time, just for me and just for this message • User Experience Everything works automatically; typical user experience is unchanged • Economics for Hotmail’s billion daily spams: 125,000 CPUs Up front capital cost: circa $150,000,000 • The spammers can’t afford it.

  10. NY Times 6/27/02 "Most of the spammers are not wealthy people," said Stephen Kline, a lawyer for the New York State attorney general's office.

  11. Cryptographic Puzzles • Hard to compute (CPU-intensive) • lots of work for the sender • Easy to check • little work for receiver • Parameterized to scale with Moore's Law • easy to exponentially increase computational cost, while barely increasing checking cost • Can be based on (carefully) weakened signatures, hash collisions

  12. Memory-Bound Puzzles [ABMW] • Slow CPUs are a lot slower than the fastest • Factor of 10 – 30 within desktops • Memory latencies vary little • factor of 3 • So: design a puzzle leading to a large number of cache misses • Equalizes actual computation time

  13. Candidate Based on “Random” f (simplified construction) • f: n bits to n bits • Puzzle Generation: • xk = f(k) (x0), extra piece p, for random x0 • Correct response: x0(p used to disambiguate) • Hope: puzzle best solvedbuilding table for f -1, working backwards from xk • Choose n so f -1 fits in small memory, but not in cache

  14. Computation: Social Issues • Trust • Who chooses f ? • Who writes the code? • Who sets the price?

  15. Computation: Technical Issues • Distribution Lists (!) • Awkward Introductory Period • Old versions of mail programs; bounces • Very Slow Machines • Can implement “post office,” but: Who gets to be the Post Office? • The Subverters

  16. Turing Tests [N’96] • CAPTCHAs (Completely Automated Public Turing test for telling Computers and Humans Apart) • Defeat automated account generation • 5-10% drop in subscription rate • teams of conjectured-humans (8-hour shifts) • Yes: Distorted images of simple word • No: ``Find 3 words in image with multiple overlapping images of words'' • Others: subject classification, audio • M. Blum: people have done preprocessing

  17. Social and Technical Issues • Social (especially in enterprise setting) • ADA, S.508 (blind, dyslexic) • Not ``professional'' • Productivity cost: context switch • Irritating. • Technical • No theory • If/when broken, these will revert to computational challenges, but with no ``hardness parameter'' • Idrive, AltaVista, broken [J]

  18. Point-to-Point Architecture (Ideal Message Flow) • permits send-and-forget • Can add post office to handle money payments m, f (m,S,R,t) Sender client S Recipient client R

  19. Here to There (and There’) • Three e-mail messages • R’s mail client caches m, h(m), S, R, t • Bounce • html attachment with Java Script for f • contains parameters for f (h(m), S, R, t) • clicking on link causes computation, sending e-mail • (optional) link for download of client software m Ignorant Sender S Spam-Protected Recipient R bounce release m

  20. Point-to-Point: Issues • Social • Unfriendly to very slow machines • Senders trust code (transition period)? • (Who gets to be a post office?) • Technical • No pre-computation possible • Function update requires new download • Sender’s browser must be configured for sending e-mail (transition period)

  21. Ticket Server [ABBW] (Ideal Message Flow) Ticket kit = (#, puzzle) Ticket = (#, response) • Any payment method • Tickets may be accumulated in advance (pre-computation) • Refunds (not shown) • Centralization eases updates TicketServer 3 Ticket OK? HTTP HTTP Recipient Server Get Ticket Kit 1 SMTP 2 MSG + Ticket Sender

  22. Social and Technical Issues • Social • Who gets to be a ticket server? Trust? • Federation? • Trust code (transition period) • Technical • Complex; 5 flows (7 with refund) • Target for subverters

  23. Cycle Stealing • Stealing cycles from humans: Pornography companies require random users to solve a CAPTCHA before seeing the next image [vABL] • Worse for computational challenges • There are lots of cycles, but anyone can buy them.

  24. Politics and Taste • Will users like this? • Transition period is awkward, senders experience some pain, recipients benefit • Establish standards • puzzle functions • mailing lists • How should the community proceed?

  25. Appendix Starts Here

  26. Medium Weight • Exploit browser power • Web server can be personal • Easily modified to facilitate changes to function f WebServer 4 SMTP 3 f(h(m)) Recipient Client Plug-In HTTP f m 1 Sender SMTP NDR 2

  27. MW Transitions 1. S sends m to R R’s plug-in checks for S on safelist, blocklist; acts appropriately If S on neither list: 2. Plug-in hashes m, caches m,h(m), sends NDR to S 3. S clicks appropriate link in NDR. URL communicates h(m) to W. New web page appears in S’s browser, containing applet; applet sends results to W. 4. W sends mail to R. Plug-in sees the mail from W and does the right thing. WebServer 4 SMTP 3 f(h(m)) Recipient Client Plug-In HTTP f m 1 Sender SMTP NDR 2

  28. Web-Based Mail (Hotmail) Ticket Server • Client downloads Active-X component or applet • Computation done by client • Protocol run by Hotmail • No code downloaded for money Alternatively • Hotmail purchases tickets for (paying?) customers, or • Hotmail promises not to send too much outgoing spam, receivers trust Hotmail 3 Ticket OK? HTTP HTTP Recipient Server (Exchange) Get Ticket Kit 1 SMTP 2 MSG + Ticket Sender: Hotmail Server Client Browser

  29. TS Transition Period • NDR contains multiple links. User clicks appropriate link. • Sender, having no ticket-handling code, invokes a script in web browser. • Probably need to roll out slowly. Examples: • Start with trivial computation (just click on link?) • NDRs on small fraction of messages • Opt-in in Hotmail • Active safelisting Ticket Server 5 “Deliver M” SMTP 4 3 HTTP Recipient Server (Exchange) Get Ticket Kit Send Ticket M 1 SMTP Sender (Outlook Express) NDR 2

  30. TS Transition Details • To handle senders that have no ticket-handling code: invoke a script in web browser • S sends message M • R keeps M, but doesn't deliver it • R returns an NDR (bounce) to S • S gets ticket kit from TS by clicking appropriate URL in NDR • S may download applet; not needed for money • S solves puzzle; sends ticket to TS • TS tells R to deliver M Ticket Server 5 “Deliver M” SMTP 4 3 HTTP Recipient Server (Exchange) Get Ticket Kit Send Ticket M 1 SMTP Sender (Outlook Express) NDR 2

More Related