Maximizing Communication for Spam Fighting

CS294, Lecture #19 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Maximizing Communication for Spam Fighting OdedSchwartz Based on: Cynthia Dwork, Andrew Goldberg, MoniNaor. On Memory-Bound Functions for Fighting Spam. Cynthia Dwork, MoniNaor, HoeteckWee. Pebbling and Proofs of Work Many slides borrowed from:http://www.wisdom.weizmann.ac.il/~naor/PAPERS/spam.ppt

Motivation Spams in: Email, IM, Forums, SMS, … What is spam? (and who is a spammer?) From the Messaging Anti-Abuse Working Group (MAAWG) 2010 report: “The definition of spam can vary greatly from country to country and as used in local legislation” (e.g., opt-in vs. opt-out). “The percentage of email identified as abusive has oscillated (mid 2009 to end of 2010) between 88% and 91%” California business and professions code, 2007: US spam cost estimate (direct and indirect): $13 billion (productivity , manpower, software,…)

Techniques for (email) spam-fighting Email filtering White lists, black lists, greylisting Text-based Trainable filters Human assisted filters … By receiver By server

Techniques for (email) spam-fighting Making the sender pay $ The system must make sending spams obtrusively unproductive for the spammer, but should not prevent legitimate users from sending their messages. Micropayments: Pay (a small amount) for each email you send.

Techniques for (email) spam-fighting Pay with human attention – Reverse Turing test. Tasks that are: Easy for humans Hard for machines Examples Gender recognition Facial expression Find body parts Decide nudity Naïve drawing understanding Handwriting understanding Filling in words Disambiguation Captcha: Completely Automated Public Turing test to tell Computers and Humans Apart"

Techniques for (email) spam-fighting Pay with computer time: Proof of Work (POW). Computation DworkNaor92 Back 97 AbadiBurrows ManasseWobber03 Communication Dwork, Goldberg, Naor 03 Dwork, Naor, Wee 05

Proof of Work If I don’t know you, prove you spent significant computational resources, just for me, and just for this message. And make it easy for me to verify.

Proof of Work Variants: Interactive: Challenge-response protocols Would you like to talk with me? Hmmm. Only if you answer my riddle. Why is a raven like a writing-desk? What's the answer? I haven't the slightest idea A. Request B. Challenge A. Respond B. Approve/reject Is interactive protocol good enough?

Proof of Work Variants: One round protocols: solution-verification The challenge is self-imposed before a solution is sought by the sender, and the receiver checks both the problem choice and the found solution. Would you like to talk with me? Here is a token of my sincerity • automated for the user • non-interactive, single-pass • no need for third party or payment infrastructure A. Compute A. Solve A. Send B. Verify

Choosing the function f Message m, Sender S, Receiver R and Date and time d • Hard to compute; f(m,S,R,d) • lots of work for the sender • cannot be amortized • Easy to check“z = f(m,S,R,d)” - little work for receiver • Parameterized to scale with Moore's Law • easy to exponentially increase computational cost, while barely increasing checking cost Example: computing a square root mod a prime vs. verifying it; x2 =y mod P

Worst case vs. Average case Message m, Sender S, Receiver R and Date and time d • Hard to compute; f(m,S,R,d) • We want f to be hard on average: cannot be amortized (over the inputs) • It is OK if some inputs are easy(a spammer can send few emails) • It is not enough if f is hard in the worst case but not on average.Otherwise, the spammer would not be able to send everything she wants, but still would be able to send many email at a low average cost.

Basing hardness: resource Goal: design a proof of work function which requires a large number of flops / memory access/ network access CPU (flops count) • Back 97, AbadiBurrows ManasseWobber03 • High variance, lower cost. Memory access • DworkNaor92, AbadiBurrows ManasseWobber03, Dwork Goldberg Naor03, DworkNaorWee 05 • Lower variance, higher cost. Network access • Abliz, Znati, 2009.

Basing hardness: resource Goal: design a proof of work function which requires a large number of flops / memory access/ network access We should assume that a spammer has better resources. The exponential gaps of annual hardware improvements doesn’t matter, as long as we have a scaling parameter of effort e for the function.

Hardness based on: Goal: design a proof of work function which requires a large number of flops / memory access/ network access Information theoretic bound: • Example: “Where is Waldo?”atpaajlssgijgnfrzkbvcwvjzbsubwsrderxfdybhrdrmsabmvrsyszcbgkvnhuppdponqawgrouhpycsstuklwfskbmbnvbsfhydoazsvhywsuhzqagwaldoanftqlbdloxhypfmovnbcmannlfytrvjsbwIgjhrdkeigjbmtibingojgnhxiwotphjcjsuqjdnfjtjnsWhere the truly random input is too large to fit in local/fast memory. Proved time/space separation Complexity assumption • Example: P ≠ NP Cryptographic assumption • Example: Discrete log A problem for which there is a fast verification scheme, but no sufficiently efficient algorithm is known • Example: matrix multiplication vs. matrix verification

USER SPAMMER • CACHE • small but fast • CACHE • cache size at most ½ user’s main memory • MAIN MEMORY • large but slow • MAIN MEMORY • may be very very large • may exploit locality memory-bound model

memory-bound model USER SPAMMER • CACHE • small but fast • CACHE • cache size at most ½ user’s main memory • charge accesses to main memory • must avoid exploitation of locality • computation is free • except for hash function calls • watch out for low-space crypto attacks • MAIN MEMORY • large but slow • MAIN MEMORY • may be very very large

Example of PoW: General scheme Consider a huge graph G, so large, that a vertex name barely fits into the fast memory. The graph is implicit, i.e., is its edges are defined by functions on the vertices names. The objective function f is a path p in G with certain (rare) property that depend on (m,S,R,d). Searching for p should be hard: its running time should depend on the number of paths in G. Verifying that a given path p has the property should depend on the length of the path |p|.

Successful Path Collection P of paths. Depends on (m,S,R,d) L

Abstracted Algorithm Sender and Receiver share large random Table T. To send message m, Sender S, Receiver R date/time d, Repeat trial for k = 1,2, … until success: Current state specified by A auxiliary table Thread defined by (m,S,R,d,k) • Initialization:A = H0(m,S,R,d,k) • Main Loop: Walk for L steps (L=path length): c = H1(A) A = H2(A,T[c]) • Success: if lastebit of H3(A) = 00…0 Attach to (m,S,R,d) the successful trial number k and H3(A) Verification: straightforward given (m, S, R, d, k,H3 (A))

H1 H2 Animated Algorithm – a Single Step in the Loop A C C = H1(A) A = H2(A,T[C]) T T[C]

Full Specification E = (expected) factor by which computation cost exceeds verification = expected number of trials = 2e If H3 behaves as a random function L = length of walk Want, say, ELt = 10 seconds, where t = memory latency = 0.2 sec Reasonable choices: E = 24,000, L = 2048 Also need: How large is A? A should not be very small… abstract algorithm • Initialize: A = H0(m,S,R,d,k) • Main Loop: Walk for L steps: • c  H1(A) • A  H2(A,T[c]) • Success if H3(A) = 0log E • Trial repeated for k = 1,2, … • Proof = (m,S,R,d,k,H3(A))

Path-following approach [Dwork-Goldberg-Naor Crypto 03] [Remarks] • lower bound holds for spammer maximizing throughput across any collection of messages and recipients • model idealized hash functions using random oracles • relies on information-theoretic unpredictability of T [Theorem] fix any spammer: • whose cache size is smaller than |T|/2 • assuming T is truly random • assuming H0,…,H3 are idealized hash functions the amortized number of memory accesses per successful message is (2eL).

Using a succinct table [DNW 05] GOAL use a table T with a succinct description • easy distribution of software (new users) • fast updates (over slow connections) PROBLEM lose information theoretic unpredictability • spammer can exploit succinct description to avoid memory accesses IDEA generate T using a memory-bound process • Use time-space trade-offs for pebbling • Studied extensively in 1970s User builds the table T once and for all

Choosing the H’s A “theoretical” approach: idealized random functions • Provide a formal analysis showing that the amortized number of memory access is high A concrete approach inspired by RC4 stream cipher • Very Efficient: a few cycles per step • Don’t have time inside inner loop to compute complex function • A is not small – changes gradually

RC4 - RivestCipher 4 • Generates a pseudorandom stream of bits • Used in: • Secure Sockets Layer (SSL) • Wired Equivalent Privacy (WEP) • Wi-Fi Protected Access (WPA) • Secure Shell (SSH) • Many other applications… • Biased Outputs, allows several types of attacks, e.g.,104-bit RC4 used in 128-bit WEP in under a minute

Pebbling a graph GIVEN a directed acyclic graph RULES: • inputs: a pebble can be placed on an input node at any time • a pebble can be placed on any non-input vertex if allimmediate parent nodes have pebbles • pebbles may be removed at any time GOAL find a strategy to pebble all the outputs while using few pebbles and few moves INPUT OUTPUT

Succinctly generating T GIVEN a directed acyclic graph • constant in-degree input node i labeled H4(i) non-input node i labeledH4(i, labels of parent nodes) entries of T =labels of output nodes OBSERVATION good pebbling strategy  good spammer strategy Lj Lk Li = H4(i, Lj, Lk) INPUT OUTPUT

Converting spammer strategy to a pebbling EX POST FACTO PEBBLING computed by offline inspection of spammer strategy • PLACING A PEBBLE place a pebble on node i if • H4 used to compute Li = H4(i, Lj, Lk), and • Lj, Lk are the correct labels • INITIAL PEBBLES place initial pebble on node j if • H4 applied with Lj as argument, and • Lj not computed via H4 • REMOVING A PEBBLE remove a pebble as soon as it’s not needed anymore • computing a label using hash function • lower bound on # moves lower bound on # hash function calls • using cache + memory fetches • lower bound on # pebbles lower bound on # memory accesses IDEA limit # of pebbles used by the spammer as a function of its cache size and # of bits it brings from memory

Succinctly generating T Need a graph that is hard to pebble on average, i.e.,:Every large set of outputs requires many initial pebbles (correspond to reading from fast/slow memory) Example:Superconcentrators INPUT OUTPUT

Open problems WEAKER ASSUMPTIONS • Unconditional result? • Use the red-blue pebbling model? • Use the edge expansion approach? • Other no-amortization proofs, for solving many instances in parallel/serial?

CS294, Lecture #19 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Maximizing Communication for Spam fighting OdedSchwartz Based on: Cynthia Dwork, Andrew Goldberg, MoniNaor. On Memory-Bound Functions for Fighting Spam. Cynthia Dwork, MoniNaor, HoeteckWee. Pebbling and Proofs of Work Many slides borrowed from:http://www.wisdom.weizmann.ac.il/~naor/PAPERS/spam.ppt

Expansion (3rd approach) [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81] Let G = (V,E) be a d-regular graph A is the normalized adjacency matrix, with eigenvalues: 1 =  1≥ 2≥…≥ n   1 - max{2, | n|} Thm: [Alon-Milman84, Dodziuk84, Alon86] For small sets:

Expansion (3rd approach) The Computation Directed Acyclic Graph WS V S S: subset of computation RS: reads WS: writes Input / OutputIntermediate valueDependency RS Communication-cost is Graph-expansion

CS294, Lecture #19 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Maximizing Communication for Spam fighting OdedSchwartz Based on: Cynthia Dwork, Andrew Goldberg, MoniNaor. On Memory-Bound Functions for Fighting Spam. Cynthia Dwork, MoniNaor, HoeteckWee. Pebbling and Proofs of Work Many slides borrowed from:http://www.wisdom.weizmann.ac.il/~naor/PAPERS/spam.ppt

Maximizing Communication for Spam Fighting

Maximizing Communication for Spam Fighting

Presentation Transcript

Fighting Spam in an Exchange Environment

Fighting Spam at AOL: Lessons Learned and Issues Raised

Database Techniques for fighting SPAM

Fighting Spam: An Innovative Enhancement to Outlook Express

Fighting Spam

Fighting SPAM: Whitelisting Revisited

Spam, Spam, Spam, Spam….

Fighting spam: the thin grey line

The Complexity of Pebbling Graphs and Spam Fighting

Fighting the Spam Email: a Long and Tough War

Fighting Spam by finding and listing Exploitable Servers

Section 12: Fighting Spam, Viruses, and Hacks

Exchange deployment at CERN and new ideas for SPAM fighting

Fighting Spam: Techniques on the Table

Fighting Spam

Fighting Spam at AOL: Lessons Learned and Issues Raised

BOTNET JUDO Fighting Spam with Itself

Fighting Spam

Fighting SPAM Spamassassin

Botnet Judo: Fighting Spam with Itself

Semalt: Fighting and Avoiding Spam Emails