1 / 29

Modern Information Retrieval

Modern Information Retrieval. Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November 5, 1999. Summary. Introduction Review of parallel computing and parallel program performance measures

jens
Download Presentation

Modern Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November 5, 1999

  2. Summary • Introduction • Review of parallel computing and parallel program performance measures • Exploration of techniques for implementing inverted file on MIMD parallel architecture • Conclusion

  3. Introduction • The volume of electronic text available online today is staggering. • The WWW contains over 800 millions pages of text, comprising nearly 6 terabytes of data (NATURE|Vol 400|8 July 1999|www.nature.com). • As document collections grow larger, they become more expensive to manage with an information retrieval system. • To support the demanding requirements of modern search environments, we must turn to alternative architectures and algorithms.

  4. Parallel Computing • Parallel computing is the simultaneous aplication of multiple processors to solve a single problem. • Flynn’s Taxonomy: • SISD single instruction, single data • SIMD single instruction, multiple data • MISD multiple instruction, single data • MIMD multiple instruction, multiple data

  5. Running time of best available sequential algorithm Running time of parallel algorithm Parallel Program Performance Measures • Speedup • Amdahl’s Law where f is the fraction of the problem that must be computed sequencially; N is the number of processors.

  6. Parallel Program Performance Measures • Efficiency where S is speedup; N is the number of processors.

  7. MIMD Architectures • MIMD architectures offer a great deal of flexibility in how parallelism is defined and exploited to solve a problem. • There are two ways in which a retrieval system can exploit a MIMD machine: • Parallel multitasking; • Partitioned parallel processing.

  8. Search Engine User Query User Query Search Engine Broker Search Engine Search Engine Result Result Search Engine MIMD Architectures Parallel multitasking on a MIMD machine

  9. Search Process Subquery/ Results User Query Search Process Broker Search Process Search Process Result Search Process MIMD Architectures Partitioned parallel processing on a MIMD machine

  10. Indexing Items k1 k2 . . . ki . . . kt d1 w1,1 w2,1 . . . wi,1 . . . wt,1 d2 w1,2 w2,2 . . . wi,2 . . . wt,2 . . .. . . . . . . . . . . . . . . . . . dj w1,j w2,j . . . wi,j . . . wt,j . . . . . . . . . . . . . . . . . . . . . dN w1,N w2,N . . . wi,N . . . wt,N D o c u m e n t s MIMD Architectures Basic data elements processed by a seach algorithm

  11. MIMD Architectures • There are two possible methods for partitioning the data: • Document partitioning: the N documents are distributed across the P processors; each parallel process evaluates the query on the subcollection of N/P documents assigned to it; • Term partitioning: the t indexing items are distributed across the P processors; the evaluation process for each document is spread over multiple processors.

  12. Inverted FilesLogical Document Partitioning • Data Partitioning • The data partitioning is done logically using essentially the same basic underlying inverted file index as in the original sequential algorithm; • The inverted file is extended to give each parallel process direct access to that portion of the index related to the processor’s subcollection of documents.

  13. Inverted List Term i Dictionary P1 P2 item i P3 P4 Inverted FilesLogical Document Partitioning Extended dictionary entry for document partitioning

  14. Inverted FilesLogical Document Partitioning • Query Evaluation • The broker initiates P parallel processes to evaluate the query; • Each process executes the same document scoring algorithm on its document subcollection; • The search processes record document scores in a single shared array of document score accumulators; • The broker produces the final ranked list of documents.

  15. Inverted Files Logical Document Partitioning • Inverted File Construction • The indexer partitions the documents among the processors; • Each indexing process generates a batch of inverted lists, sorted by indexing item; • A merge step is performed to create the final inverted file.

  16. Inverted FilesPhysical Document Partitioning • Data Partitioning • The documents are physically partitioned into separate subcollections, one for each parallel processor; • Each subcollection has its own inverted file.

  17. Inverted FilesPhysical Document Partitioning • Query Evaluation • The broker distributes the query to all of the parallel search processes; • Each parallel search process evaluates the query on its portion of the document collection, producing an intermediate hit-list; • The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list.

  18. Inverted FilesPhysical Document Partitioning • Inverted File Construction • Each processor creates, in parallel, its own complete index corresponding to its document partition; • A merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries.

  19. Inverted FilesTerm Partitioning • Data Partitioning • Inverted lists are spread across the processors.

  20. Inverted FilesTerm Partitioning • Query Evaluation • Query is decomposed into indexing items and each indexing item is sent to the processor that holds the corresponding inverted list; • The processors create hit-lists with partial document scores and return them to the broker; • The broker combines the hit-lists.

  21. Inverted FilesTerm Partitioning • Inverted File Construction • Inverted file is created using the parallel construction technique described for logical document partitioning.

  22. Example Document collection Document Text 1 Pease porridge hot 2 Pease porridge cold 3 Pease porridge in the pot 4 Pease porridge hot, pease porridge not cold 5 Pease porridge cold, pease porridge not hot 6 Pease porridge hot in the pot

  23. Dictionary Inverted Lists cold <2,1> <4,1> <5,1> hot <1,1> <4,1> <5,1> <6,1> in <3,1> <6,1> not <4,1> <5,1> pease <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> porridge <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> pot <3,1> <6,1> the <3,1> <6,1> Example Inverted File

  24. Dictionary Inverted List Term “pease” cold hot <1,1> in <2,1> not P1 <3,1> pease P2 <4,2> porridge P3 <5,2> pot <6,1> the Example Logical Document Partitioning

  25. cold <2,1> hot cold <1,1> <5,1> P1 pease hot <1,1> <5,1> <2,1> <6,1> cold porridge in <4,1> <1,1> <6,1> <2,1> hot <4,1> not <5,1> P3 in <3,1> pease <5,2> <6,1> not <4,1> porridge <5,2> <6,1> P2 pease <3,1> <4,2> pot <6,1> porridge <3,1> <4,2> the <6,1> pot <3,1> the <3,1> Example Physical Document Partitioning

  26. cold <2,1> <4,1> <5,1> P1 hot <1,1> <4,1> <5,1> <6,1> in <3,1> <6,1> not <4,1> <5,1> pease P2 <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> porridge <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> pot <3,1> <6,1> P3 the <3,1> <6,1> Example Term Partitioning

  27. Conclusion • The task of indexing and searching in very large text collections is costly; • Faster indexing and searching algorithms are always desirable and the use of parallel hardware is and obvious alternative; • We discussed two possible organization for the document collection index on a MIMD parallel architecture: • Document partitioning; • Term partitioning.

  28. Conclusion • Document partitioning affords simpler inverted index construction and maintenance than term partitioning; • When term distributions in the documents and queries are more skewed, document partitioning performs better; • When terms are uniformily distributed in user queries, term partitioning performs better.

  29. Adicional References Lawrence, S., Giles, C.L. 1999. Accessibility of Information on the Web. Nature. Vol.400.pp.107-109. Ribeiro-Neto, B.A., Barbosa, R.A. 1998. Query Performance for Tighly Coupled Distributed Digital Libraries. Digital Libraries 98. pp.182-190. Ribeiro-Neto, B.A., Moura, E.S., Neubert, M.S., Ziviani, N. 1999. Efficient Distributed Algorithms to Build Inverted Files. SIGIR’99. pp.105-112.

More Related