1 / 27

Database Techniques for fighting SPAM

Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li Everybody knows about SPAM Spam is unsolicited bulk email sent for profit and general mayhem. BOTNETs = Distributed Network of hijacked IPs. IPs hard to track

Mia_John
Download Presentation

Database Techniques for fighting SPAM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li

  2. Everybody knows about SPAM • Spam is unsolicited bulk email sent for profit and general mayhem. • BOTNETs = Distributed Network of hijacked IPs. • IPs hard to track • 70 billion emails sent per day. 70% spam

  3. How Anti-SPAM uses DBs? • Spam databases collect network layer and application layer data. • IP Blacklisting • Detect a malicious host during SMTP dialog. • Difficult to detect IP address DHCP, botnet size or good IPs used to forward • Content Analysis • Detect malicious mail content. • Requires that MTA complete the SMTP connection. • Arms race between content filter designers and spammers.

  4. Summary of DB Techniques • Grey Space Analysis • Trinity: Peer-to-Peer Database • Behavioral Blacklisting • Progressive Email Scanning • Content filtering using Bayesian Analysis

  5. Grey Space Analysis • Characterize IP Space: Active vs. Grey Space • IP Flow Database • Detect malicious IPs by extracting dominant scanning ports (DSPs) • Find DSPs using relative uncertainty algorithm

  6. Mining Technique: Relative Uncertainty • Determines entropy of IP ports in flows database. • Formula := Entropy of dstPrt distribution ÷ maximum entropy. • p := number of flows with port[i] ÷ total flows • RU close to 1 shows ~even distribution, near 0 shows uneven distribution

  7. Grey Space Algorithm • Isolate flows toward grey space • Find dominant scanning ports (DSPs) • Find outside hosts with DSPs flows toward grey and active hosts. • Find inside host footprint for outside hosts. • Classify adversary as hitter or scanner.

  8. Focused Hitters vs Bad Scanners • Focused hitters tend to send tens or hundreds of flows to each grey host. • Bad scanners send one or a few flows to each grey host

  9. Trinity: Distribute IP Reputation Database • Botnets send a large amount of data in a short amount of time. • Trinity uses distributed in-memory hash table containing IP reputation entries. • Each peer has 10 to 50 megabytes of data (833K – 4.17M entries)

  10. Chord Distributed Hash Table • Distribute data over a large P2P network • Quickly find any given item • Stores key/value pairs • The key value controls which node(s) stores the value • Each node is responsible for some section of the space • Basic operations • Store(key; val) • val = Retrieve(key)

  11. Chord (cont) • Each node chooses a n-bit ID • IDs are arranged in a ring • Each lookup key is also a n-bit ID • i.e., the hash of the real lookup key • Node IDs and keys occupy the same space! • Each node is responsible for storing keys “near" its ID • Replication usaully between current and previous node • Items can be replicated at multiple successors • No single host contains large fraction of a particular space to guard against DDoS.

  12. Database Updates • Compute the number of interval quarters since last update. Shift and update counters accordingly • Determine site responsible for entry and send UDP. Once received by owner site, forward entry to k peers using TCP. • Updates communicative, order doesn’t matter. Consistency not required. • Even if host goes down, database can be rebuilt in an hour.

  13. Security • Secure communications for neighbors • Limit updates for nodes that have sent more than 100 emails in 10 minutes. • Falsified source IPs can cause false positives.

  14. Clustering Technique for Behavioral Blacklisting • Identify spammers that attack many domains. • Domain distribution and frequency is the sending pattern • Form clusters of sending patterns • Use clusters to ID new attack

  15. Spectral Clustering • Divide Phase – produces a tree whose leaves are elements of the set. • Merge Phase – Start with each leaf in its own cluster and merge going up the tree.

  16. Vector Generation • Database contains: M(i,j,k) • Total times that IP ‘i' sent email to domain ‘j’ in time slot ‘k’. • Find total flows for IP/Domain across entire time axis (M’). • Generate feature vector from M’ • IP := <#flows to domain 1, # flows to domain 2, … #flows to domain j>

  17. Clustering • Clusters contain IP addresses that send mail to similar sets of domains. • Define traffic pattern for each cluster • Averaging the rows (vector contents) for all IPs in the cluster. IPxIP matrix of related spam senders

  18. Classification • Input IP vector ‘r’ :=1 x d vector • Use similarity algorithm to find closes cluster • Spam score is the maximum similarity of r with any cluster.

  19. Progressive Email Scanner • Maintains Feature Instance (FI) database • FI is any feature that can discriminate HAM from SPAM. • Dynamic Features - Use any feature that IDs mail such as contents, network, etc.) • Paper only uses URL links as FIs

  20. PEC Architecture • FI States • Grey (Ambiguous FI) • Black (Spam FI) • White (HAM FI) • Blacklist Module – Extracts and hashes FIs • Scoreboard Module – Tracks FI occurrences and timestamp (age)

  21. Competitive Aging and Scoring System (CASS) • Transition between states governed by • Score – number of occurrence of FI • Age – time since last score update. • Score (R) exceeds score threshold (S) causes Grey to Black transition. • Age (A) exceeds age threshold (M) triggers Grey to White transition. • Purge

  22. Bayesian Content Filtering • Determine the probability that a message is spam based on contents • Use Bayesian combination of spam probabilities

  23. Bayesian Training • Requires training corpus of HAM/SPAM • Find interesting tokens. • Create HAM/SPAM token tables

  24. Classification Hi, Just a reminder: don’t forget your allergy prescription when you visit New York City today. Mom Spam Probability Table Sample Message • Tokenize new message • Calculate spam probability for each message • Derive overall spam probablity using Bayes formula. Sample Message = 0.0 • Non-spam tokens outweigh spam tokens to prevent false positives

  25. Real World Applications Messaging Security Architecture TrustedSource.org

  26. Summary • A variety of database techniques are used in Anti-Spam Technology • IP Blacklisting • Content filtering • Databases can contain: • Network traffic: IP Addresses, Domain, Ports • Message Content: Words, URLs, HTML Text • Challenges: • Scalability – Must handle many connections or messages • Minimize False Positive Rates – Cannot classify a HAM message as SPAM. • Finding useful SPAM features. Using machine learning techniques.

  27. References • Brodsky, et al, A Distributed Content Independent Method for Spam Detection, HotBots 2007 • Jin, et al, Identifying and Tracking Suspicious Activities through IP Gray Space Analysis, MineNet 2007 • Liu, et al, High-Speed Detection of Unsolicited Bulk Emails, ANCS 2007 • Ramachandran, A., Filtering Spam with Behavioral Blacklisting, CCS 2007 • Cheng, et al., A Divide-and-Merge Methodology for Clustering, ACM Transactions on Database Systems, 2006 • Graham P., A Plan for Spam, www.paulgraham.com/spam.html, 2002 • Secure Computing Corporation, http://trustedsource.org, 2008

More Related