1 / 25

Query-Driven Indexing for Scalable P2P Text Retrieval

Infoscale’07, June 6-8, 2007 Suzhou, China. Alvis. Query-Driven Indexing for Scalable P2P Text Retrieval. Gleb Skobeltsyn EPFL, Switzerland June 6, 2007. Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer. Goal.

chaney
Download Presentation

Query-Driven Indexing for Scalable P2P Text Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Infoscale’07, June 6-8, 2007 Suzhou, China Alvis Query-Driven Indexing forScalable P2P Text Retrieval Gleb Skobeltsyn EPFL, Switzerland June 6, 2007 • Joint work with: • Toan Luu • Ivana Podnar Žarko • Martin Rajman • Karl Aberer G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  2. Goal • Our goalis to achieve scalable full-text retrieval with structured P2P networks (DHTs) • Each peer: • Provides resources (bandwidth, storage) • Searches the whole network • Publishes its own documents DHT G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  3. K I K I K I K I K I K I K I K I K I h(“gleb”)-{d2,d3} h(“epfl”)-{d1,d2} h(t’)-{d4,d5} {d1,d2} {d2} Naïve (single-term) approach ... is to distribute the global inverted index in a DHT: This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor Query: “epfl & gleb” G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  4. Indexing with Highly Discriminative Keys [1] Scalable Peer-to-Peer Web Retrieval withHighly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07, Istambul, Turkey G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  5. Indexing with HDKs: main properties • Distributed index contains {key,PL} pairs: • Each keycorresponds to a term or a set of terms • Each key is assigned to a posting list • Each posting list stores at most DFmax top-ranked document references. • Data-Driven key generation: • Each time a new document is indexed, some posting lists for a key k can reach the max size of DFmax • It triggersthe generation of new keys (k + other frequent keys) • Proximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w). G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  6. HDK – exhaustive data driven indexing • Pro’s: • ICDE’07 paper proves that the number of keys grows linearly • Elegant key generation mechanism • Low bandwidth while query processing (PL’s of limited size) • Con’s: • Practically the number of keys is LARGE: 68M for 0.6M docs • High bandwidth consumption at indexing • Problem: • Too many keys are superfluous (almost never used) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  7. Query Driven Indexing Lets index only what is queried! G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  8. Contents • Introduction • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • ONM • Scalability • Evaluation • Conclusion G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  9. Query-Driven Index (QDI) • Query-Driven Indexing strategy solves the “Too-Many-Keys” problem: • Avoids maintenance of superfluous keys • Generates only such keys that are requested by users • Utilizes query-log to discover such keys • Problems • Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key • Opportunistic Notification Mechanism (smart-broadcast) • Incomplete index causes degradation of query results quality • Show that the degradation is low G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  10. Which keys to index? • Each single-term found in the document collection is has to be indexed. • We call all single-term keys a basic single term index. • The posting lists are truncated at DFmax. • A key k is non-superfluous and can be activated iff: • k is popular: QF(k) ≥QFmin, where QF(k) is the popularity of the key k derived from the available query log and QFminis a parameter for our model (popularity filter). • k contains from 2 to smax terms: 2≤|k|≤ smax, where smax is a parameter of our model (size filter). • all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  11. QDI: Retrieval ?abc nothing • Single term index is generated • Process abc • Probe Pabc • Probe PabPbc and Pac • Probe PaPb and Pc • Obtain top-DFmax results for a,b and c(ranked w.r.t a,b and c respectively) • Contact peers in the list, re-rank the obtained results w.r.t abc • Output top-10 • Inc. the QF for ab, bc and ac • Activate (index) ac ?abc peer ?abc b ab ac bc a c abc +1 +1 +1 popular nothing nothing nothing DFmax G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  12. QDI: Retrieval 2 • Assume the frequency of b is below DFmax • Note, how the redundancy filter would simplify the lattice in such a case (grayed nodes cannot be activated) abc ab bc ac a b c abc ab bc DFmax G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  13. QDI: Retrieval 3 ?abc nothing • Single term index is generated and ac is indexed • Process abc • Probe Pabc • Probe PabPbc and Pac – obtain the result for ac • Probe Pb and obtain the result for b • Contact all peers in the list to re-rank the obtained results w.r.tabc • Output top-10 • Inc. the QF for ab, bcand ac ?abc peer ?abc ab abc c a ac bc b +1 +1 +1 nothing nothing G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  14. Opportunistic Notification Mechanism • ONM used to activate a new multi-term key • ONM is a “smart” broadcast with the following features: • It is based on the shower multicast [2]: each peer within a specified range is contacted only once • Notifications are small and low-priority => piggybacking • Broadcast is split into several multicast sessions, each time pruning low-score documents • It uses the high-performance DHT layer [3] [2]A. Datta, M. Hauswirth, R. Schmidt, R. John, K. Aberer: Range Queries in Tree-Structured Overlays, in P2P’05 [3]F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer: Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  15. Scalability • The retrieval traffic is bounded by a constant due to trun-cated posting lists (depends on DFmax and a query size) • The indexing traffic depends on the number of keys to be activated. • The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents • The number of keys does not depend on the document collection size but only on the size of the query log • We can use the QFmin parameter to adjust the tradeoff: indexing traffic <-> retrieval quality G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  16. Contents • Introduction • HDK approach for indexing • Query-driven approach for indexing/retrieval • Indexing structure • Example • ONM • Scalability • Evaluation • Conclusion G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  17. Overlap experiment • Use the Wikipedia query-log (9M queries/9-10.2004) to build the index • Choose randomly 3K test queries • Answer each test query with Google and compare to the union of top-DFmax Google results for each of its combinations that areindexed according to the logs. • Mimics our P2PIR system if Google’s ranking is used. • Example: Non-superfluous (indexed) combinations Original query X X overlap@5=3/5=60% G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  18. Overlap example • Cut-n-paste from the simulation log: >id=481,q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0---->Ov@100=100% “1920 babe”, qf=0--------->Ov@100= 9% +++“1920 ruth”, qf=1--------->Ov@100=33% +++“babe ruth”, qf=495 ------->Ov@100= 69% ---“1920”, qf=716 ------------>Ov@100= 1% ---“babe”, qf=3196 ----------->Ov@100= 2% ---“ruth”, qf=1653 ----------->Ov@100= 7% Size: 192, Keys used: 2, Overlap@100: 94% G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  19. Overlap with Google G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  20. Overlap with Yahoo G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  21. Overlap with Google (no/partial/full overlap) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  22. P2P Index Simulations • Number of keys depends only on the query log size and QFmin! • Does not depend on the collection size! • Number of keys is much smaller than for the HDK approach: 68M keys for 650K doc G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  23. Real query logs? • Wikipedia queries are unrealistic (too skewed) as users know what they want. • Real web-queries might • perform worse? • Large scale experiments • with real web queries and • the TREC collection in [4] • [4] Web Text Retrieval with a P2P Query-Driven IndexG. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. AbererTo appear in SIGIR’07 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  24. Conclusions • We presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: • Stores posting lists in a DHT for terms andterm combinations • Stores at most DFmax top document references in a posting list • Efficiently collects the query statisticsin a distributed fashion • Based on this statistics activates (indexes) only popularkeys • Computes the result of a multi-term query based only on the index entries available at the moment – nocostly intersections • We also showed that: • With real query-logs our approach achieves good retrieval quality • The QFmin parameter adjusts the traffic/quality tradeoff G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

  25. Last slide Thank you for your attention! Questions? G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval

More Related