160 likes | 253 Views
Indexing. CSCI 572: Information Retrieval and Search Engines Summer 2010. Outline. Building your search corpus Differences from RDBMS The Document/Field Model The Flattening Process Understanding Field Types Challenges. Building an index.
E N D
Indexing CSCI 572: Information Retrieval and Search Engines Summer 2010
Outline • Building your search corpus • Differences from RDBMS • The Document/Field Model • The Flattening Process • Understanding Field Types • Challenges
Building an index • Once you have contentin the form of metadataand extracted text, you need to persist that content • For querying • For retrieval and display • How should we persistthe content?
Some considerations • Extracted metadata is typically unstructured • It’s not something that necessarily maps to a set of Entities (Tables), with rows and with consistent columns • Documents have different, sometimes, non-overlapping metadata models • Dublin Core • Word • Climate Forecast • The write/access patterns are a bit different • Think crawling strategies…
Databases versus Search Indices • Databases are optimized for • Write often • Read often • Transactional properties in the face of the above • Atomic – operations should occur atomically, or be rolled back • Consistent – writes/etc., should be propogated in a consistent fashion • Isolated – transactions and modificationslimited to the entities that they modify • Durable – expected to be running all the time and thus resilient in the face of catastrophic failure
Databases versus Search Indices • Search Indices are optimized for • Write infrequently • Read very frequently • Based off a loose unstructured Document model • Limited transactional properties • ACID not necessary • Onus to produce results quickly • Rollback not supported most often • Subject to corruption • Extremely efficient in terms of queryread times by exploiting the above
The Document Field Model • A method of dealing with unstructured data and its persistence to an index • Treat each indexable content item as a “Document” • Each Document has 1…N named Fields • Each Field has 1…N values • Values can be: • Text • Numerical • Hierarchical (made up of other fields) • Complex (Geospatial, etc.) Field1: v…vnField2: v…vn
Example: two web pages • Document 1 • Field [title], Value(s): “Chris Mattmann’s Web Page” • Type: string (text) • Field [length], Value(s): 3026 • Type: int (assumed to be bytes) • Field [author], Value(s): Chris Mattmann • Document 2 • Field [title], Value(s): “CS572 Web Page” • Type: string (text) • Field [length], Value(s): 10000 • Type: int, (assumed to be bytes) • Field [author], Value(s): Chris Mattmann, Univ. of Southern California
Example: a word document • Document 3 • Field [title], Value(s): “My CS572 Final Project” • Type: string (text) • Field [length], Value(s): 30012 • Type: int (assumed to be bytes) • Field [wordcount], Value(s): 2912 • Type: int • Field [mswordversion], Value(s): 2008, Mac • Type: string (text)
Apples to Oranges • Whether it’s an HTML page, a Word document, a PDF file, etc. • We can still use the Document/Field model to represent the content as itis indexed • The Document Field model works for Metadata, but also for extracted text • Define a custom text field containing all extracted, searchable text
What about structure? • For example, let’s say we are extracting Person records from a RDBMS to index • We’ve got 2 tables • Person • Attribute: id, int PK UNIQUE AUTO INCREMENT • Attribute: first_name VARCHAR(255) • Attribute: last_name VARCHAR(255) • PersonAddress • Attribute: person_id FK to Person.id • Attribute: address_txt, VARCHAR(255) • Attribute: zipcode, int
What about structure? • Example records • Person: • id, first_name, last_name • 1, Chris, Mattmann • 2, Homer, Simpson • PersonAddress: • person_id, address_txt, zipcode • 1, 1234 Joe Lane, 91354 • 2, 6344 Evergreen Terrace, Springfield, IL, 60999
What about structure? • How to get the aforementioned rows into the Document Field model? • Flatten the structure • Document 1 • Field [first_name], Value(s): Chris • Type: string (text) • Field [last_name], Value(s): Mattmann • Type: string (text) • Field [id], Value(s): 1 • Type: int • Field [addresstxt], Value(s): Joe Lane • Type: string (text) • Field [zipcode], Value(s): 91354 • Type: int • Document 2 • Field [first_name], Value(s): Homer • Type: string (text) • Field [last_name], Value(s): Simpson • Type: string (text) • Field [id], Value(s): 2 • Type: int • Field [addresstxt], Value(s): 6344 Evergreen Terrace, Springfield, IL • Type: string (text) • Field [zipcode], Value(s): 60999 • Type: int
Benefits of the Document Field model • Documents are independent, wholly contained entities • Reduces ACID dependencies • Increases the ability to become eventually consistent • Fields can be indexed and stored in different ways • Reformatted on entry into the index, and reformatted on the way out • Geohash great example of this • Analyzers – implications on query model • Tokenizers – implications on query model
Challenges • Reducing structured data to unstructured, flattened data isn’t exactly as easy as the cooked up example • Imagine having to encode values to preserve ordering in some fashion • Requires deep understanding of the data and methodologies for naming field names and ordering values • Loss of ACID properties makes it difficult to leverage index structure for search directly in transactional systems • Have to stand up search as a separate service outside of data management system • Determining the right tuning parameters to index • Max Buffer Size, When to Optimize, When to Merge, etc.
Wrapup • Introduction to the Document Field indexing model • Differences between traditional RDBMS models and Search indices • Know when and where to use each • Search optimized for read frequent, write infrequent