Modern Information Retrieval

Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section 9.2.2.: MIMD Architectures Inverted Files November 5, 1999

Summary • Introduction • Review of parallel computing and parallel program performance measures • Exploration of techniques for implementing inverted file on MIMD parallel architecture • Conclusion

Introduction • The volume of electronic text available online today is staggering. • The WWW contains over 800 millions pages of text, comprising nearly 6 terabytes of data (NATURE|Vol 400|8 July 1999|www.nature.com). • As document collections grow larger, they become more expensive to manage with an information retrieval system. • To support the demanding requirements of modern search environments, we must turn to alternative architectures and algorithms.

Parallel Computing • Parallel computing is the simultaneous aplication of multiple processors to solve a single problem. • Flynn’s Taxonomy: • SISD single instruction, single data • SIMD single instruction, multiple data • MISD multiple instruction, single data • MIMD multiple instruction, multiple data

Running time of best available sequential algorithm Running time of parallel algorithm Parallel Program Performance Measures • Speedup • Amdahl’s Law where f is the fraction of the problem that must be computed sequencially; N is the number of processors.

Parallel Program Performance Measures • Efficiency where S is speedup; N is the number of processors.

MIMD Architectures • MIMD architectures offer a great deal of flexibility in how parallelism is defined and exploited to solve a problem. • There are two ways in which a retrieval system can exploit a MIMD machine: • Parallel multitasking; • Partitioned parallel processing.

Search Engine User Query User Query Search Engine Broker Search Engine Search Engine Result Result Search Engine MIMD Architectures Parallel multitasking on a MIMD machine

Search Process Subquery/ Results User Query Search Process Broker Search Process Search Process Result Search Process MIMD Architectures Partitioned parallel processing on a MIMD machine

Indexing Items k1 k2 . . . ki . . . kt d1 w1,1 w2,1 . . . wi,1 . . . wt,1 d2 w1,2 w2,2 . . . wi,2 . . . wt,2 . . .. . . . . . . . . . . . . . . . . . dj w1,j w2,j . . . wi,j . . . wt,j . . . . . . . . . . . . . . . . . . . . . dN w1,N w2,N . . . wi,N . . . wt,N D o c u m e n t s MIMD Architectures Basic data elements processed by a seach algorithm

MIMD Architectures • There are two possible methods for partitioning the data: • Document partitioning: the N documents are distributed across the P processors; each parallel process evaluates the query on the subcollection of N/P documents assigned to it; • Term partitioning: the t indexing items are distributed across the P processors; the evaluation process for each document is spread over multiple processors.

Inverted FilesLogical Document Partitioning • Data Partitioning • The data partitioning is done logically using essentially the same basic underlying inverted file index as in the original sequential algorithm; • The inverted file is extended to give each parallel process direct access to that portion of the index related to the processor’s subcollection of documents.

Inverted List Term i Dictionary P1 P2 item i P3 P4 Inverted FilesLogical Document Partitioning Extended dictionary entry for document partitioning

Inverted FilesLogical Document Partitioning • Query Evaluation • The broker initiates P parallel processes to evaluate the query; • Each process executes the same document scoring algorithm on its document subcollection; • The search processes record document scores in a single shared array of document score accumulators; • The broker produces the final ranked list of documents.

Inverted Files Logical Document Partitioning • Inverted File Construction • The indexer partitions the documents among the processors; • Each indexing process generates a batch of inverted lists, sorted by indexing item; • A merge step is performed to create the final inverted file.

Inverted FilesPhysical Document Partitioning • Data Partitioning • The documents are physically partitioned into separate subcollections, one for each parallel processor; • Each subcollection has its own inverted file.

Inverted FilesPhysical Document Partitioning • Query Evaluation • The broker distributes the query to all of the parallel search processes; • Each parallel search process evaluates the query on its portion of the document collection, producing an intermediate hit-list; • The broker collects the intermediate hit-lists from all of the parallel search processes and merges them into a final hit-list.

Inverted FilesPhysical Document Partitioning • Inverted File Construction • Each processor creates, in parallel, its own complete index corresponding to its document partition; • A merge step is performed to accumulate the global statistics for all of the partitions and distribute them to each of the partition dictionaries.

Inverted FilesTerm Partitioning • Data Partitioning • Inverted lists are spread across the processors.

Inverted FilesTerm Partitioning • Query Evaluation • Query is decomposed into indexing items and each indexing item is sent to the processor that holds the corresponding inverted list; • The processors create hit-lists with partial document scores and return them to the broker; • The broker combines the hit-lists.

Inverted FilesTerm Partitioning • Inverted File Construction • Inverted file is created using the parallel construction technique described for logical document partitioning.

Example Document collection Document Text 1 Pease porridge hot 2 Pease porridge cold 3 Pease porridge in the pot 4 Pease porridge hot, pease porridge not cold 5 Pease porridge cold, pease porridge not hot 6 Pease porridge hot in the pot

Dictionary Inverted Lists cold <2,1> <4,1> <5,1> hot <1,1> <4,1> <5,1> <6,1> in <3,1> <6,1> not <4,1> <5,1> pease <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> porridge <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> pot <3,1> <6,1> the <3,1> <6,1> Example Inverted File

Dictionary Inverted List Term “pease” cold hot <1,1> in <2,1> not P1 <3,1> pease P2 <4,2> porridge P3 <5,2> pot <6,1> the Example Logical Document Partitioning

cold <2,1> hot cold <1,1> <5,1> P1 pease hot <1,1> <5,1> <2,1> <6,1> cold porridge in <4,1> <1,1> <6,1> <2,1> hot <4,1> not <5,1> P3 in <3,1> pease <5,2> <6,1> not <4,1> porridge <5,2> <6,1> P2 pease <3,1> <4,2> pot <6,1> porridge <3,1> <4,2> the <6,1> pot <3,1> the <3,1> Example Physical Document Partitioning

cold <2,1> <4,1> <5,1> P1 hot <1,1> <4,1> <5,1> <6,1> in <3,1> <6,1> not <4,1> <5,1> pease P2 <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> porridge <1,1> <2,1> <3,1> <4,2> <5,2> <6,1> pot <3,1> <6,1> P3 the <3,1> <6,1> Example Term Partitioning

Conclusion • The task of indexing and searching in very large text collections is costly; • Faster indexing and searching algorithms are always desirable and the use of parallel hardware is and obvious alternative; • We discussed two possible organization for the document collection index on a MIMD parallel architecture: • Document partitioning; • Term partitioning.

Conclusion • Document partitioning affords simpler inverted index construction and maintenance than term partitioning; • When term distributions in the documents and queries are more skewed, document partitioning performs better; • When terms are uniformily distributed in user queries, term partitioning performs better.

Adicional References Lawrence, S., Giles, C.L. 1999. Accessibility of Information on the Web. Nature. Vol.400.pp.107-109. Ribeiro-Neto, B.A., Barbosa, R.A. 1998. Query Performance for Tighly Coupled Distributed Digital Libraries. Digital Libraries 98. pp.182-190. Ribeiro-Neto, B.A., Moura, E.S., Neubert, M.S., Ziviani, N. 1999. Efficient Distributed Algorithms to Build Inverted Files. SIGIR’99. pp.105-112.

Modern Information Retrieval