Architecture of Information Retrieval Systems

CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems

Course Administration CS 490 and CS 790 Independent Research Projects Next semester, the National Science Digital Library (NSDL) will have several projects that are suitable for independent research. Topics will include selective web crawling, annotation services and information discovery. If you are interested, send email to wya@cs.cornell.edu. Assignment 4 due date If you want an extension to 5 p.m. on Monday, send email to cs430@cs.cornell.edu.

Cornell SIGCHI - Student Chapter Interested in user interface design? Cornell has a chapter of the ACM’s special interest group for computer human interaction (SIGCHI)Meeting tonight in the Information Science Building(301 College Ave.) at 5pm.For info: email ajf15 or google: cornell sigchi Benefits of membership include but are not limited to: access to a network of students and faculty interested in HCI, help with securing professional or graduate positions in HCI, access to information about courses and research on campus, access to the resources of the Cornell HCI lab, regular meetings to share ideas and have fun, cool magazine if you join national org.

Basic Architecture 1: Single Homogeneous Collection • Documents and indexes are held on a single computer system (may be several computers). • The user interface and search methods are selected for the specific service. Index Documents Examples: Medline (medical information) Westlaw (legal information)

Basic Architecture 2: Several Similar Collections -- One Computer System • Several more or less similar collections are held on a single computer system. • Each collection is indexed separately using the same software, procedures, algorithms, etc. (with minor differences, e.g., stoplists). • The user interface is the same (or very similar) for each service. Examples: OCLC's FirstSearch

Distributed Architecture 1: Standard Search Protocols Find x Strict adherence to standards allows any user interface to search any conforming search service. Find x

Distributed Architecture 1: Standard Search Protocols Example: Z 39.50 Family of Standards for Searching Library Catalogs Content:Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets,separators, etc.) Message Passing Protocol:Z 39.50 Query Format:Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e.g. the Internet suite of protocols.

Distributed Architecture 1: Standard Search Protocols Example: Z 39.50 Family of Standards for Searching Library Catalogs The Z 39.50 family of standards has proved successful in a tightly knit community, where: • There is a strong tradition of standardization, with many professionally trained people. • The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historic note: WAIS was based on an early version of Z39.50.

Distributed Architecture 2: Broadcast Search Interface Service An interface server broadcasts a query to each collection, combines the results and returns them to the user. Examples: Dienst (digital library protocol), Web metasearch services Find x

Distributed Architecture 2: Broadcast Search Interface Service: Can be a separate server (e.g., CGI), or run on the user's computer (e.g., applet). Protocols: In the simple version, each collection must support the same standards and protocols (e.g., Z 39.50, http, etc.).

Distributed Architecture 2: Broadcast Search Problems with Broadcast Search • Performance: If any collection does not respond, the Interface Server waits for a time out. • Recall: If any collection does not respond, documents in that collection are not found. • Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization.

Distributed Architecture 3: Centralized Search Services Search Service Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central system, and (b) retrieves items from collections. Examples: Union catalogs, Web search services retrieve search Find x

Distributed Architecture 3: Centralized Search Services Gathering by Web Crawling • Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but ... • Can only gather openly accessible materials. • Cannot gather material in databases unless explicit URLs are known. • Cannot easily make use of metadata provided by collections. Examples: Web search services.

Distributed Architecture 3: Centralized Search Services Harvesting • Each collection makes a copy of its metadata available from a sever associated with the collection. • A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages ... • Can index material from databases without explicit URLs. • Allows authentication and selection of material. but ... • Requires that collections have metadata and support harvesting protocol (e.g., Open Archives Initiative Protocol for Metadata Harvesting).

The National Science Digital Library (NSDL) The Integration Task is to provide a coherent set of collections and services across great diversity (all digital collections relevant to science education). http://nsdl.org/

Interoperability in the NSDL The Problem Conventional approaches require partners to support agreements (technical, content, and business) But NSDL needs thousands of very different partners ... most of whom are not directly part of the NSDL program The challenge is to create incentives for independent digital libraries to adopt agreements

The NSDL Search Service Full Text or Metadata? Full text indexing is excellent, but is not possible for all materials (non-textual, no access for indexing). Comprehensive metadata is available for very few of the materials. What Architecture to Use? Few collections support an established search protocol (e.g., Z39.50).

Architecture for Searching Basic Assumptions • The integration team will not manage any collections • The integration team will not create any metadata

Options for Effective Information Retrieval Comprehensive metadata with Boolean retrieval (e.g., monograph catalog). Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available. Full text indexing with ranked retrieval (e.g., news articles). Excellent for relatively homogeneous textual material. Full text indexingwith contextual information and ranked retrieval (e.g., Google). Excellent for mixed textual information with rich structure. Contextual information about non-textual materialsand ranked retrieval (e.g., Google image retrieval). Promising, but still experimental.

The Spectrum of Interoperability Level Agreements Example Federation Strict use of standards AACR, MARC (syntax, semantic, Z 39.50 and business) Harvesting Digital libraries expose Open Archives metadata; simple metadata harvesting protocol and registry Gathering Digital libraries do not Web crawlers cooperate; services must and search engines seek out information

The Metadata Repository Services The metadata repository is a resource for service providers. It holds information about every collection and item known to the NSDL,including contextual information. Metadata repository Users Collections

Search Service Metadata Repository Harvest Portal SDLIP Search andDiscoveryService Portal Portal Crawl Collections

Acknowledgements The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education. The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research (Dave Fulker), Columbia University (Kate Wittenberg) and Cornell University (Bill Arms). The Technical Director is Carl Lagoze (Cornell University). The Search Service is being developed by James Allan and colleagues at the University of Massachusetts, Amherst.

CS 430 Information Discovery The End!

Architecture of Information Retrieval Systems

Architecture of Information Retrieval Systems

Presentation Transcript

RIM in the Age of E-Discovery

Discovery

LBSC 796/INFM 718R: Week 12 Question Answering

E-Discovery

Litigation and Procedure Discovery: Overview and Interrogatories

CS 430: Information Discovery

Accidental Archaeology

Multi-channel information for AP discovery

CS 430: Information Discovery

ADMINISTRATIVE DISCOVERY:

Internet Resources Discovery (IRD)

“ The Discovery of Reverse Transcriptase”

IEPAD: Information Extraction Based on Pattern Discovery

CS 430: Information Discovery

Discovery Scenarios

Discovery Education

Peer to Peer Discovery

CS 430: Information Discovery

CyberBridges Protein Pattern Discovery

LBSC 796/INFM 718R: Week 10 Clustering, Classifying, and Filtering

CS 430: Information Discovery

Save on Your Motorhome Hire in NZ!