1 / 27

SPAM FILTERING

SPAM FILTERING. By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014. What is Spam Email? . “junk email” or “unsolicited commercial email”. Spam filtering - a special case of email classification. Only 2 classes – Spam and Non-spam. Various Approaches. Bayesian Learning

leia
Download Presentation

SPAM FILTERING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SPAM FILTERING By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014

  2. What is Spam Email? • “junk email” or “unsolicited commercial email”. • Spam filtering - a special case of email classification. • Only 2 classes – Spam and Non-spam.

  3. Various Approaches • Bayesian Learning • Probabilistic model for Spam Filtering • Bag of Words Representation • Ripper algorithm • Context Sensitive Learning. • Boosting algorithm • Improving Accuracy by combining weaker hypotheses.

  4. Term Vectors

  5. Naive Bayes for Spam • Seeking model to find P(Y=1/X1=x1,X2=x2,..,Xd=xd) • From Bayes theorem P(Y=1/X1=x1,..,Xd=xd) = P(Y=1) * P(X1=x1,..,Xd=xd/Y=1) P(X1=x1,..Xd=xd) P(Y=0/X1=x1,..,Xd=xd) = P(Y=0) * P(X1=x1,..,Xd=xd/Y=0) P(X1=x1,..Xd=xd)

  6. Justification of using Bayes Theorem • Sparseness of data • P(B/A) can be easily and accurately determined as compared to P(A/B)

  7. Naive Bayes for Spam(contd.) Assume P(X1=x1,..,Xd=xd/Y=k) = ∏ P(Xi=xi / Y=k) Also assume Xi = 1 if no of occurrence of word i >= 1 = 0 otherwise

  8. Naive Bayes for Spam(contd.) referred to as weights of evidence • Inconsistency when some probability is zero. • Smooth the estimates by adding a smooth positive constant to both numerator and denominator of each probability estimate

  9. Classifying • Assume new mail with text • “The quick rabbit rests” 0.51 + 0.51 + 0 + 0.51 + 1.10 + 0 = 2.63 Probability = 0.93

  10. Threshold • Lower threshold • Higher false positive rate • Higher threshold • Higher false negative rate • Preferred

  11. Non-Linear Classification Linear Classifier Ignores the effect of Context of word on its meaning. Unrealistic . Build a linear classifier that test for more complex Features like Simultaneous Occurrences. High Computation Cost !! Non-Linear Classification is the Solution

  12. Ripper • Disjunction of Different Contexts • Each Contexts is conjunction of Simple terms • Context of w1 is : if w2 belongs to data and w3 belongs to data. i.e. for context to be true w1 must occur with w2 and w3. Three Components of Ripper Algorithm:

  13. Rule Learning : • Spam  spam Є Subject • Spam  Free Є Subject ,Spam Є Subject. • Spam Gift!! Є Subject, Click Є Subject. • The rule would be disjunction of three statements stated above. • There is an initial set of rules too

  14. Constructing Rule Set • Initial Rule Set is Constructed Using a greedy Strategy. • Based on the IREP (Incremental Reduced Error Pruning) • To Construct A new Rule partitioning Dataset into two parts training Set And Pruning Set is Done. • Every Time a Single condition is Added to Rule.

  15. Simplification And Optimization • At every step the density of +ve examples covered is increased. • Adding stops until clause cover no –ve example or there is no positive gain. • After this, pruning i.e. simplification is done. At every stage, again following greedy Strategy

  16. Reaching Sufficient Rules • The clause is deleted which maximizes the Function where U+(i+1) and U-(i-1) are the positive and negative examples. • Termination when information gain is non-zero i.e. every rule covers +ve examples. • But If data is noisy then number of rules increase

  17. MDL • Several heuristics are applied to solve the problem. MDL(Minimum Description Length) is one of them. • After addition of each rule , total length of current rule set and example is calculated. • Addition of rule is stopped when this length is d bits larger than shortest length.

  18. AdaBoost • Easy to find rule of thumb which are often correct • If ‘buy now’ occurs in message, then predict ‘spam’ • Hard to find one rule which is very accurate • AdaBoost helps here • general method of converting rough rules of thumb into highly accurate prediction rule • Concentrating on hard examples

  19. Pictorially

  20. Algorithm • Input S = { (Xi , Yi) } mi=1 • Initialize D(i) = 1/m for all i • For i = 1 to T • H(t) = WeakLearner(S,Dt) • Choose βt • ln((1-ε)/ε) (proven to Minimize error for 2class) [2] • Update Dt+1(i) = Dt(i) exp(-βtYiht(xi)) and Normalize • Final Hypotheses f(x) = ∑βt ht(x)

  21. Example

  22. Example

  23. Accuracy • Weighted accuracy measure • (λL- + S+) / (λL + S) • λ strictness measure • L : # legitimate messages S : # spam • L- : #legitimate messages classified as legitimate • S+ : #spam classified as spam • Improving accuracy • Increase λ • Introduce θ threshold • Example classified positive only if f(x) > θ • Default is ZERO • Recall correctly predicted spam out of number of spam in corpus • Precision correctly classified spam out of number predicted as spam

  24. Resultson corpus PU1 . . . [1]

  25. Pros and Cons • Fast and Simple • No parameters to tune • Flexible • Can combine with any learning Algorithm • No knowledge needed of WeakLearner • Error reduces exponentially • Robust to overfitting • Data Driven – requires lots of data • Performance depends on WeakLearner • May fail if WeakLearner is too weak

  26. Conclusion • RIPPER as text categorization algorithm works better than Naïve Bayes (better for more classes). • Comparable for spam filtering (2 classes) • Boosting better than any weak learner it works on.

  27. References • [1] Boosting trees for Anti Spam Email Filtering by Xavier Carreras and Llius Marquez 2001. • [2] The boosting approach to machine learning: An overview. by Robert E. Schapire in MSRI Workshop on Nonlinear Estimation and Classification, 2002. • [3] Statistics and The War on Spam by David Madigan, David Madigan, 2004. • [4] Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000. http://citeseer.ist.psu.edu/androutsopoulos00evaluation.html • [5] William W. Cohen, Yoram Singer: Context-sensitive Learning Methods for Text Categorization. SIGIR 1996: 307-315

More Related