1 / 17

What is WEB SPAM

What is WEB SPAM. Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages” . Document IDs. THE WEB. Display results on a web page. Retrieve full text of relevant documents. Index the documents. Rank Result. Search Engine Servers. Inverted Index.

violet-pope
Download Presentation

What is WEB SPAM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”

  2. Document IDs THE WEB Display results on a web page Retrieve full text of relevant documents Index the documents Rank Result Search Engine Servers Inverted Index Get indices for relevant documents Query What do Web Spammers do Web Spammers target the last step

  3. Web spam(you know it when you see it)

  4. Defining web spam • Spam web page is… • A page created for the sole purpose of attracting search engine referrals (to this page or some other “target” page) • Ultimately a judgment call • Some web pages are borderline useless • Some pages look fine in isolation, but in context are clearly “spam”

  5. Spamming Techniques • Boosting Rank: • Term Spamming • Link Spamming • Hiding Techniques: • Content Hiding • Cloaking • Redirecting

  6. Boosting by Term Spamming • Editing the textual content • The Search engine looks for relevant terms in various fields • Different fields are weighed different

  7. Keyword stuffing • Search engines return pages that contain query terms • (Certain caveats and provisos apply …) • One way to get more SE referrals: Create pages containing popular query terms (“keyword stuffing”) • Three variants: • Hand-crafted pages • Completely synthetic pages • Assembling pages from “repurposed” content

  8. Monetization Random words Well-formed sentences stitched together Links to keep crawlers going Examples of synthetic content

  9. Someone’s wedding site! Examples of synthetic content

  10. “Nigritude Ultramarine”: An SEO competition Links to keep crawlers going Grammatically well-formed but meaningless sentences Really good synthetic content

  11. Link Spamming • Link structure importance • Outgoing links • Incoming links • Use Directories • Link exchange and spam farms

  12. Link spam • Link spam: Inflating the rank of a page by creating nepotistic links to it • From own sites: Link farms • From partner sites: Link exchanges • From unaffiliated sites (e.g. blogs, guest books, web forums, etc.) • The more links, the better • Generate links automatically • Use scripts to post to blogs • Synthesize entire web sites • Synthesize many web sites (DNS spam) • The more important the linking page, the better • Buy expired highly-ranked domains • Post links to high-quality blogs

  13. Link farms and link exchanges

  14. The trade in expired domains

  15. Web forum and blog spam

  16. Hiding Techniques • Invisible content • Cloaking: serve different page to a crawler than to a browser • Techniques: • Recognize page request is from search engine (based on “user-agent” info or on IP address) • Make some text invisible (i.e. black on black) • Use CSS to hide text • Use JavaScript to rewrite page (dynamically created) • Use “meta-refresh” to redirect user to other page

  17. Why should we care • We depend on search engines and trust them • Web Spam undermines the reputation of a trusted information source

More Related