1 / 30

The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration

The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration. Vortrag an der Universität Trier, 13ter Februar 2007. Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany

tiara
Download Presentation

The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The CompleteSearch Engine:Interactive, Efficient,and Towards IR & DB integration Vortrag an der Universität Trier, 13ter Februar 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea, Deb Majumdar, Christian Mortensen, Fabian Suchanek, Markus Tetzlaff, Thomas Warken, Ingmar Weber, …

  2. IR versus DB (simplified view) • IR system (search engine) single data structure and query algorithm, optimized for ranked retrieval on textual data highly compressible and high locality of access ranking is an integral part can't do even simple selects, joins, etc. • DB system (relational) variety of indices and query algorithms, to suit all sorts of complex queries on structured data  space overhead and limited locality of access  no integrated ranked retrieval  can do complex selects, joins, … (SQL) scales very wellbut special-purpose general-purposebut slow on large data

  3. Our work (in a nutshell) • The CompleteSearch engine novel data structure and query algorithm for context-sensitive prefix search and completion  highly compressible and high locality of access  IR-style ranked retrieval  DB-style selects and joins  natural blend of the two  subsecond query times for up to a terabyte on a single machine  no transactions, recovery, etc. for low dynamics (few insertions/deletions) other open issues at the end of the talk … fairly general-purposeand scales very well

  4. Context-Sensitive Autocompletion • Complete to words that would lead to a hit • saves typing, avoids overspecification of query, find out about formulations used, error correction, etc. • Complete to phrases • for the phrase uni trier • add the word uni_trierto the index • Complete to subwords • for the compound word eigenproblem • add the word problemto the index • Complete to arbitrary substrings • there are standard techniques • but usually not worth it (in text search)

  5. Semantic Completion • Complete to instances of categories • for the author Henning Fernau • add henning:fernau::authorand fernau::henning:author • Complete to names of categories • for the author Henning Fernau • add author:henning_fernau • Refine search result by category (faceted search) • add ct:conference:stacs • add ct:author:henning_fernau • add ct:year:2005 • proactively launch query with ct: appended

  6. DB-style joins • Find authors which have published at SIGIR and SIGMOD • must collect information from several documents • no way to do this with standard keyword search • with our context-sensitive prefix completion, we can launch conference:sigir author:* conference:sigmod author:* • and intersect the list of completions(not documents) • Like that can realize any kind of join • note that adding conference:stacsauthor:henning_fernauyear:2005 etc. effectively creates a table with schema (conference, author, year, publication)

  7. Incorporating Ontologies (ongoing work) • Consider an entity like John Lennonwho we know was a • singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, … • We cannot add all the annotations to every occurrence of John Lennon • index size would explode • better to keep annotations separately • But we can • add entity:john_lennon for every occurrence • in a special document about him, add entity:john_lennon along with class:songwriter, class:musician, class:person, … • And then intersect the completions of, for example, • beatles entity: and class:musician entity:

  8. Related Engines suggests whole queries from precompiled list

  9. Related Engines similar to Google Suggest + proactively snaps to one query and shows result

  10. Context-Sensitive Prefix Search D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 D17 D88 … C D E F G H

  11. Context-Sensitive Prefix Search D74 J W Q D3 Q DA • Data is given as • documents containing words • documents have ids (D1, D2, …) • words have ids (A, B, C, …) • Query • given a sorted list of doc ids • and a range of word ids • Answer • all matching word-in-doc pairs • with scores D17 B WU K A D17 B WU K A D43 D Q D1 A O E W H D92 P U D E M D53 J D E A D78 K L S D27 K L D F D9 E E R D4 K L K A B D88 P A E G Q D88 P A E G Q D2 B F A D32 I L S D H D98 E B A S D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H

  12. Solution via an Inverted Index (INV) • For example, db* given the sorted list of all document ids given the range of word ids matching db* • Iterate over all words from W Word 781(dbms) Doc. 16, Doc. 53, Doc. 591, ... Word 782 (db2) Doc. 3, Doc. 66, Doc. 765, ... Word 783 (dbase) Doc. 25, Doc. 98, Doc. 221, ... Word 784 (dbis) Doc. 67, Doc. 189, Doc. 221, ... Word 785 (dblp) Doc. 16, Doc. 110, Doc. 141, ... • Have to merge the lists Doc. 3, Doc. 16, Doc. 16, Doc. 25, … Word 782, Word 781, Word 785, Word 783, … query time = output size ∙ log(size of W)

  13. Solution via an Inverted Index (INV) • For example, db* uni* given the doc id list: Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for db*) given the range of word ids matching uni* • Iterate over all words from W Word 578 (uniform) Doc. 8, Doc. 23, Doc. 291, ... Word 579 (unit) Doc. 24, Doc. 36, Doc. 165, ... Word 580 (uni trier) Doc. 3, Doc. 18, Doc. 66, ... Word 581 (unique) Doc. 56, Doc. 129, Doc. 251, ... Word 582 (university) Doc. 18, Doc. 21, Doc. 25, ... • Intersect each list with D, then merge Doc. 3, Doc. 18, Doc. 18, Doc. 25, …Word 580, Word 580, Word 582, Word 582, … query time = size of D ∙ size of W + merging

  14. The Inverted Index (INV) — Problems • Asymptotic time complexity is bad (for our problem) • with INV we either have to merge/sort a lot • or intersect the same list over and over again • Still a tough baseline to beat in practice • highly compressible • half the space on disk means half the time to read it • INV has very good locality of access • the ratio random access time/sequential access time is 50,000 for disk, and still up to 100 for main memory • simple code • instruction cache, branch prediction, etc.

  15. A Tree-Based Index (AutoTree) SPIRE 2006 • Output-sensitive behaviour • query time = size of result list • anytime algorithm: produces result element in every step • Beats the inverted index by a factor of 5 • but only in main memory • heavy use of bit rank data structures (to compute number of set bits before a given position in constant time)

  16. A Hybrid Index (HYB) SIGIR 2006 • HYB has a block for each word range, conceptually: • Replace doc ids by gaps and words by frequency ranks: • Encode both gaps and ranks such that x  log2 x bits +0  0+1  10+2  110 1st (A)  0 2nd (C)  10 3rd (D)  111 4th (B)  110 • An actual block of HYB

  17. INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV isΣ ni∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙nis Σ ni∙ ((1+ε)/ln 2 + log2(n/ni)) ni= number of documents containing i-th word, n = number of documents Nice match of theory and practice

  18. INV vs. HYB — Query Time • Experiment: type ordinary queries from left to right db , dbl , dblp , dblp un , dblp uni , dblp univ , dblp unive , ... INV HYB HYB beats INV by an order of magnitude

  19. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  20. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  21. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  22. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  23. Engineering • With HYB, every query is essentially one block scan • perfect locality of access, no sorting or merging, etc. • balanced ratio of read, decompression, processing, etc. • Careful implementation in C++ • Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth)

  24. System Design — High Level View Compute ServerC++ Web ServerPHP User ClientJavaScript Debugging such an application is hell!

  25. Conclusions • Summary • central mechanism for context-sensitive range search • very efficient in space and time, scales very well • combines IR-style ranked retrieval with DB-style selects and joins • support for interactive / semantic / faceted / ontology search • On our TODO list • achieve both output-sensitivity and locality of access • integrate top-k query processing • find out which SQL queries can be supported efficiently? • deal with high dynamics (many insertions/deletions) Thank you!

  26. Basic Problem Definition • Definition: Context-sensitive prefix search and completion • Given a query consisting of • sorted list Dof doc ids Doc15Doc183Doc185Doc17351 … • range Wof word ids Word1893 – Word7329 • Compute as a result • all (w, d) w Є W, d Є DDoc15Doc15Doc17351... sorted by doc id Word7014Word5112Word2011… • Refinements • positions Pos12Pos73Pos44... • scores 0.70.30.5...

  27. Basic Problem Definition • For example, dblp uni • set D = document ids from result for dblp • range W = word ids of all words starting with uni →multi-dimensional query processed as sequence of 1½ dimensional queries • For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author:

  28. Basic Problem Definition • For example, dblp uni • set D = document ids from result for dblp • range W = word ids of all words starting with uni →multi-dimensional query processed as sequence of 1½ dimensional queries • For example, intersect completions of resultsfor conf:sigir author: and conf:sigmod author: → efficient, because completions are from small range

  29. Conclusions • Context-sensitive prefix search and completion • is a fundamental operation • supports autocompletion search, semantic search, faceted search, DB-style selects and joins, ontology search, … • efficient support via HYB index • very good compression properties • perfect locality of access • Some open issues • integrate top-k query processing • what else can we do with it? • very short prefixes

More Related