1 / 24

Improving Web Spam Classification using Rank-time Features

Improving Web Spam Classification using Rank-time Features. September 25, 2008 TaeSeob , Yun KAIST DATABASE & MULTIMEDIA LAB. Contents. Introduction Support Vector Machine Data Set Domain Separation Rank-time features Evaluation Summary. Introduction. World Wide Web(WWW)

kort
Download Presentation

Improving Web Spam Classification using Rank-time Features

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob ,Yun KAIST DATABASE & MULTIMEDIA LAB

  2. Contents • Introduction • Support Vector Machine • Data Set • Domain Separation • Rank-time features • Evaluation • Summary DATABASE & MULTIMEDIA LAB

  3. Introduction • World Wide Web(WWW) • Definition • An information space in which the items of interest, referred to as resources, are identified by global identifiers [IAN04] • Description • Too much information • Needs Web Search Engines DATABASE & MULTIMEDIA LAB

  4. Introduction • Web Search Engine • Definition • A search engine designed to search for information on the World Wide Web [WIK08] • Description • Retrieves pages relevant to users’ query • Ranking is become important • Web Spam interferes Web Search Engines DATABASE & MULTIMEDIA LAB

  5. Web Spam(1/2) • Definition • A page that uses bad method to improve ranking [KRI07] • Object • Mislead web search engines’ rank algorithm • Make profit by increase page’s traffic • Reason why we should remove Web Spam • Users spend too much time to search for information • Ranking on search engines is critical for making profit • Reduce search engine’s resources DATABASE & MULTIMEDIA LAB

  6. Web Spam(2/2) • Type of Web spam • Link stuffing • Keyword stuffing • Cloaking • Web farming • When to remove Web Spam • Crawl-time • Index-time • Rank-time • How to remove Web Spam • By training machine – Support Vector Machine(SVM) DATABASE & MULTIMEDIA LAB

  7. Support Vector Machine(1/2) v1 n dimensions ? v2 <3 dimensions> <2 dimensions> • Definition • A set of related supervised learning methods used for classification and regression[WIK08] • Description • Find separating hyperplane with maximal margin on vector space DATABASE & MULTIMEDIA LAB

  8. Support Vector Machine(2/2) • Procedure • Collect Datasets • Classify Datasets into Training Datasets and Test Dataset • Train the machine with Training Datasets • Test the machine with Test Dataset • Problem • We need to collect Datasets DATABASE & MULTIMEDIA LAB

  9. Dataset • Definition • A set of labeled sample data for training and test • Collecting Procedure • Collect common query lists from MSN Live search engine • Label each of top-10 result as spam, non-spam or unknown by human judge • Classify dataset into training datasets and a test dataset • Classification method on datasets • Very important! • We choose Domain Separation DATABASE & MULTIMEDIA LAB

  10. Domain Separation(1/6) • Definition • A classification method that classify according to domains • Procedure(in this paper) • For each URL from dataset • Calculate hash value by domain • If a new hash value comes, assign it randomly into 5 files • If the hash value comes again, put into the assigned file • Adjust 5 files into similar size • Why should we choose Domain Separation? DATABASE & MULTIMEDIA LAB

  11. Domain Separation(2/6) • Domain separated vs. Randomly separated • Opinion • Domain separated datasets are better • The result trained with randomly separate dataset is WRONG! • It’s general classification problem in machine learning • Reason • If there exists subsets in dataset, and they has features, we should use those features • In fact, some spammers buy a domain for making spam page, it’s common that whole pages related that domain labeled spam • How to make domain separated datasets? DATABASE & MULTIMEDIA LAB

  12. Domain Separation(3/6) • Five-fold cross validation • Definition • A method for training and test the SVM using in this paper • Procedure • Choose one of five domain-separated datasets as a test set • Choose other domain-separated datasets as training datasets • Train the SVM with 4 training datasets • Test the SVM with a test set • Repeat above procedures at all combination of sets DATABASE & MULTIMEDIA LAB

  13. Domain Separation(4/6) • The result of domain separation • Total 31,300 URLs • 3,133 spam labeled URLs(9.99%) • Problem • Learning feature vector to subset hash to label may turn out to be wildly and incorrectly optimistic • Leave future work DATABASE & MULTIMEDIA LAB

  14. Domain Separation(5/6) • Description • No duplicated domain • Consists 25% spam • Couldn’t use domain information • Worst-case graph DATABASE & MULTIMEDIA LAB

  15. Domain Separation(6/6) • Description • Add additional feature • Consists 10% spam • More difficult to detect than 25% spam • Result • Still little bit lower than randomly sep., but it’s worst-case • Note : Still couldn’t use domain information DATABASE & MULTIMEDIA LAB

  16. FEATA(1/2) • Description • Rank independent features • FEATA includes • Domain-level features • Page-level features • Link information DATABASE & MULTIMEDIA LAB

  17. FEATA(2/2) • Description • Average precision 60% at 10.8% recall • Consists of 10% spam • Not so good • We will add Rank-time features! DATABASE & MULTIMEDIA LAB

  18. Rank-time Features • Definition • Features using on rank-time • Motivation • Every page has feature vector • Shape of spam/non-spam pages’ feature vector is different • Spammer can’t guess distribution of non-spam feature vector • Consist of • Query independent features(FEATB) • Query dependent features(FEATQ) DATABASE & MULTIMEDIA LAB

  19. FEATB • Definition • Query independent, rank-time features • Description • Page-level features • Domain-level features • Popularity features • Time features DATABASE & MULTIMEDIA LAB

  20. FEATQ • Definition • Query dependent, rank-time features • Description • Depend on the match between query and document property • Examine for each returned result • Future work • Label spam on the URL only, not on the relevance of a URL to a query DATABASE & MULTIMEDIA LAB

  21. Evaluation • Micro averaged on five tests DATABASE & MULTIMEDIA LAB

  22. Summary • Classification of Web Spam is an important problem • We can classify Web Spam by training on the SVM • Making training datasets as domain-separated datasets is very important • Rank-time features improve classification performance by as much as 25% in recall at a set precision DATABASE & MULTIMEDIA LAB

  23. References • [KRY07]Krysta, M., Qiang, W., Chris, J., Aaswath, R,. “Improving Web Spam Classification using Rank-time Features”, AIRWeb ’07, May 8, 2007 • [IAN04] Ian, J., “Architecture of the World Wide Web, Volume One”, W3C Recommendation, Dec 15, 2004 • [WIK08] “Web Search Engine”, “Support Vector Machine”, http://wikipedia.org, Sep 25, 2008 DATABASE & MULTIMEDIA LAB

  24. [Appendix A] Receiver Operating Characteristic DATABASE & MULTIMEDIA LAB

More Related