Block-based Web Search: Challenges and Solutions

Block-based Web Search Deng Cai*1, Shipeng Yu*2, Ji-Rong Wen* and Wei-Ying Ma* *Microsoft Research Asia 1Tsinghua University 2University of Munich

Problems in Traditional IR • Term-Document Irrelevance Problem • Noisy terms • Multiple topics • Variant Document Length Problem • Length normalization is important • Passage Retrieval in traditional IR • Partition the document to several passages • Solve the problem in some sense • Has three types of passages: discourse, semantic, window • Fixed-window passage is shown to be robust

Problems in Web IR • Noisy information • Navigation • Decoration • Interaction • … • Multiple topics • May contain text as well as images or links Noisy Information Multiple Topics

Problems in Web IR (Cont.) • Variant Document Length Problem Conclusion: in web IR all the problems of traditional IR remain and are more severe!

Challenges in Web IR • New characteristics of web pages • Two-Dimensional Logical Structure • Visual Layout Presentation • Page segmentation methods can be achieved • Obtain blocks from web pages • Block-based web search is possible Font Size Color Space Font Style Separator

Outline • Motivation • Page segmentation approaches • Web search using page segmentation • Block Retrieval • Block-level Query Expansion • Experiments and Discussions • Conclusion

Web Page Segmentation Approaches • Fixed-length approach (FixedPS) • Traditional window-based passage retrieval • DOM-based approach (DomPS) • Like the natural paragraph in traditional passage retrieval • Vision-based Web Page Segmentation (VIPS) • Achieve a semantic partition to some extent • Combined Approach (CombPS) • Combined VIPS & Fixed-length

Fixed-length Page Segmentation (FixedPS) • A block contains words of fixed-length • Traditional window-based methods can be applied • Approaches • Overlapped windows (e.g. Callan, SIGIR’94) • Arbitrary passages of varying length (e.g. Kaszkiel et al, SIGIR’97) • Results • A simple but robust approach • Do not consider semantic information

DOM-based Page Segmentation (DomPS) • Rely on the DOM structure to partition the page • DOM: Document-Object Model • Current approaches • Only base on tags (e.g. Crivellari et al, TREC 9) • Combine tags with contents and links (e.g. Chakrabarti et al, SIGIR’01) • Results • Similar to discourse in passage retrieval • DOM represents only part of the semantic structure • Imprecise content structure

VIPS Algorithm • Motivation • Topics can be distinguished with visual cues in many cases • Utilize the two-dimensional structure of web pages • Goal • Extract the semantic structure of a web page to some extent, based on its visual presentation • Procedure • Top-down partition the web page based on the separators • Result • A tree structure, each node in the tree corresponds to a block in the page • Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception

VIPS: An Example Microsoft Technical Report MSR-TR-2003-79

Combined Approach (CombPS) • VIPS solves the problems of noisy information and multi-topics • FixedPS can deal with the variant document length problem • Combine these two: • Partition the web page using VIPS • Divide the blocks containing more words than pre-defined window length Block length after segment 50,000 pages using VIPS chosen from the WT10g data set

Web Page Segmentation Summarization • Fixed-length approach (FixedPS) • traditional passage retrieval • DOM-based approach (DomPS) • Like the natural paragraph in traditional passage retrieval • Vision-based Web Page Segmentation (VIPS) • Achieve a semantic partition to some extent • Combined Approach (CombPS) • Combined VIPS & Fixed-length

Block Retrieval • Similar to traditional passage retrieval • Retrieve blocks instead of full documents • Combine the relevance of blocks with relevance of documents • Goal: • Verify if page segmentation can deal with both the length normalization and multiple-topic problems

Block-level Query Expansion • Similar to passage-level pseudo-relevance feedback • Expansion terms are selected from top blocks instead of top documents • Goal: • Testify if page segmentation can benefit the selection of query terms through increasing term correlations within a block, and thus improve the final performance

Experiments • Methodology • Fixed-length window approach (FixedPS) • Overlapped window with size of 200 words • DOM-based approach (DomPS) • Iterate the DOM tree for some structural tags • A block is constructed and identified by such leaf tag • Free text between two tags is treated as a special block • Vision-based approach (VIPS) • The permitted degree of coherence is set to 0.6 • All the leaf nodes are extracted as visual blocks • The combined approach (CombPS) • VIPS then FixedPS • Full document approach (FullDoc) • No segmentation is performed

Experiments (Cont.) • Dataset • TREC 2001 Web Track • WT10g corpus (1.69 million pages), crawled at 1997 • 50 queries (topics 501-550) • TREC 2002 Web Track • .GOV corpus (1.25 million pages), crawled at 2002 • 49 queries (topics 551-560) • Retrieval System • Okapi, with weighting function BM2500 • Preprocessing • Standard stop-word list • Do not use stemming and phrase information • Tune parameters in BM2500 to achieve best baselines • Evaluation criteria: P@10

Experiments on Block Retrieval • Steps: • Do original document retrieval • Obtain a document rank DR • Analyze top N (1000 here) documents to get a block set • Do block retrieval on the block set (same as Step 1 but replace the document with block) • Obtain a block rank BR • Documents are re-ranked by the single-best block in each document • Combine the BR and DR to get a new rank of document • is the tuning parameter

Block Retrieval on TREC 2001 and TREC 2002 (P@10) Result on TREC 2002 (P@10) Result on TREC 2001 (P@10)

Experiments on Block-level Query Expansion • Steps: • Same steps as block retrieval • Do original document retrieval to get DR • Analyze top N (1000 here) documents to get a block set • Do block retrieval on the block set to get BR • Select some expansion terms based on top blocks • 10 expansion terms in our experiments • Number of top blocks is a tuning parameter • Document retrieval with the expanded query • Modify the term weights before final retrieval

Query Expansion on TREC 2001 and TREC 2002 (P@10) Result on TREC 2002 (P@10) Result on TREC 2001 (P@10)

Discussions • FullDoc can only obtain a low and insignificant result • The baseline is low, so many top ranked documents are actually irrelevant • DomPS is not good and very unstable • The segmentation is too detailed • Semantic block can hardly be detected and expansion terms are not good • FixedPS is stable and good • Similar result as the case in traditional IR • A window may miss the real semantic blocks • VIPS is very good • Top blocks usually have very good quality • Length normalization is still a problem • CombPS is almost the best method in all experiments • More than just a tradeoff

Conclusion • Page segmentation is effective for improving web search • Block Retrieval • Block-level Query Expansion • Plain-text retrieval  Fixed-window’s partition Web information retrieval  Semantic partition (VIPS) • Integrating both semantic and fixed-length properties (CombPS) could deal with all problems and achieve the best performance • We believe that block-based web search can be very useful in real search engines, and can also be very easily combined with block-level link analysis

Thanks!

Block-based Web Search: Challenges and Solutions

Block-based Web Search: Challenges and Solutions

Presentation Transcript

Search web

Block-based Web Search

WISE: Large Scale Content-Based Web Image Search

Resolution Based Search

Web Search

Content Based Search

Web Search

Web search

Web Search

Block 2 Search Tips 2011

Web Search

Web Search

Location-based search: services, photos, web

Block-based Web Search

Web-based geographic search engine for location-aware search in Singapore

Web Search

Web Search

Web Search

Web Search

Web Search

Web Scrapping based on a YouTube Search