1 / 25

Document Indexing and Scoring in Lucene and Nutch

Learn how to index documents using Lucene, create Lucene index, use scoring functions and search efficiently in Nutch architecture.

Download Presentation

Document Indexing and Scoring in Lucene and Nutch

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Indexing and Scoring in Lucene and Nutch IST 441 Spring 2009 Instructor: Dr. C. Lee Giles Presenter: Saurabh Kataria

  2. Outline • Architecture of Lucene and Nutch • Indexing in Lucene • Searching in Lucene • Lucene’s scoring function

  3. Lucene’s Open Architecture Crawling Parsing Indexing Lucene Stop Analyzer Standard Analyzer CN/DE/ Analyzer PDF HTML DOC TXT … File System Lucene Docu- ments TXT parser FS Crawler Index indexer PDF parser WWW indexer Larm HTML parser IMAP Server searcher searcher Searching Spring 2008

  4. Nutch’s architecture • Courtesy of Doug Cutting’s presentation slide in WWW 2004

  5. Nutch’s architecture • Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display. • Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes. • Web DB: Stores the document contents for indexing and later summarization by the searcher, along with information such as the link structure of the document space and the time each document was last fetched. • Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch.

  6. Lucene’s index (conceptual) Index Document Document Field Field Field Document Name Value Field Document Field Spring 2008

  7. Create a Lucene index (step 1) • Create Lucene document and add fields import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public void createDoc(String title, String body) { Document doc=new Document( ); doc.add(new Field(“text", “content”, Field.Store.NO, Field.Index.TOKENIZED)); doc.add(new Field(“title", “test”, Field.Store.YES, Field.Index.TOKENIZED)); } Spring 2008

  8. Create a Lucene index (step 2) • Create an Analyser • Options • WhitespaceAnalyzer • divides text at whitespace • SimpleAnalyzer • divides text at non-letters • convert to lower case • StopAnalyzer • SimpleAnalyzer • removes stop words • StandardAnalyzer • good for most European Languages • removes stop words • convert to lower case Spring 2008

  9. Create a Lucene index (step 2) • An example of analyzing a document Spring 2008

  10. Create a Lucene index (step 3) • Create an index writer, add Lucene document into the index import java.IOException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyser; public void WriteDoc(Document doc, String idxPath) { try{ IndexWriterwriter = new IndexWriter(FSDirectory.getDirectory(“/data/index", true), new SimpleAnalyzer(), true); writer.addDocument(doc); writer.close( ); } catch (IOException exp) { System.out.println(“I/O Error!”); } } Spring 2008

  11. Luence Index – Behind the Scene • Inverted Index (Inverted File) Doc 1: Penn State Football … football Posting Table Doc 2: Football players … State Spring 2008

  12. Posting table • Posting table is a fast look-up mechanism • Key: word • Value: posting id, satellite data (#df, offset, …) • Lucene implements the posting table with Java’s hash table • Objectified from java.util.Hashtable • Hash function depends on the JVM • hc2 = hc1 * 31 + nextChar • Posting table usage • Indexing: insertion (new terms), update (existing terms) • Searching: lookup, and construct document vector Spring 2008

  13. Lucene Index Files: Field infos file (.fnm) 1, <content, 0x01> Spring 2008

  14. Lucene Index Files: Term Dictionary file (.tis) 4,<<0,football,1>,2> <<0,penn,1>, 1> <<1,layers,1>,1> <<0,state,1>,2> Document Frequency can be obtained from this file. Spring 2008

  15. Lucene Index Files: Term Info index (.tii) 4,<football,1> <penn,3><layers,2> <state,1> Spring 2008

  16. Lucene Index Files: Frequency file (.frq) <<2, 2, 3> <3> <5> <3, 3>> Term Frequency can be obtained from this file. Spring 2008

  17. Lucene Index Files: Position file (.prx) <<3, 64> <1>> <<1> <0>> <<0> <2>> <<2> <13>> Spring 2008

  18. Query Process in Lucene Field info (in Memory) Constant time Query Term Info Index (in Memory) Constant time Constant time Constant time Constant time Term Dictionary (Random file access) Frequency File (Random file access) Position File (Random file access) Spring 2008

  19. Search Lucene’s index (step 1) • Construct an query (automatic) import org.apache.lucene.search.Query; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.analysis.standard.StandardAnalyser; public void formQuery(String querystring) { QueryParser qp = new QueryParser (field, new StandardAnalyser( )); Query query = qp.parse(querystring); } Spring 2008

  20. Search Lucene’s index (step 1) • Types of query: • Boolean: [IST441 Giles] [IST441 OR Giles] [java AND NOT SUN] • wildcard: [nu?ch] [nutc*] • phrase: [“JAVA TOMCAT”] • proximity: [“lucene nutch” ~10] • fuzzy: [roam~] matches roams and foam • date range • … Spring 2008

  21. Search Lucene’s index (step 2) • Search the index import org.apache.lucene.document.Document; import org.apache.lucene.search.*; import org.apache.lucene.store.*; public void searchIdx(String idxPath) { Directory fsDir=FSDirectory.getDirectory(idxPath, false); IndexSearcher is=new IndexSearcher(fsDir); Hits hits = is.search(query); } Spring 2008

  22. Search Lucene’s index (step 3) • Display the results for (int i=0;i<hits.length();i++) { Document doc=hits.doc(i); //show your results System.out.println(“id”+doc.get(id)); } Spring 2008

  23. Default Scoring Function • Similarity score(Q,D)   =   coord(Q,D)  · queryNorm(Q)   ·  ∑ t in Q ( tf(t in D)  ·  idf(t)2 ·  t.getBoost() · norm(D) ) • Question: • What type of IR model does Lucene use? • factors • term-based factors • tf(t in D) : term frequency of term t in document d • default implementation • idf(t): inverse document frequency of term t in the entire corpus • default implementation Spring 2008

  24. Default Scoring Function • coord(Q,D) = overlap between Q and D / maximum overlap • Maximum overlap is the maximum possible length of overlap between Q and D • queryNorm(Q) = 1/sum of square weight½ • sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2 • If t.getBoost() = 1, q.getBoost() = 1 • Then, sum of square weight = ∑ t in Q ( idf(t) )2 • thus, queryNorm(Q) = 1/(∑ t in Q ( idf(t) )2) ½ • norm(D) = 1/number of terms½(This is the normalization by the total number of terms in a document. Number of terms is the total number of terms appeared in a document D.) Spring 2008

  25. Example: • D1: hello, please say hello to him. • D2: say goodbye • Q: you say hello • coord(Q, D) = overlap between Q and D / maximum overlap • coord(Q, D1) = 2/3, coord(Q, D2) = 1/2, • queryNorm(Q) = 1/sum of square weight½ • sum of square weight = q.getBoost()2 · ∑ t in Q ( idf(t) · t.getBoost() )2 • t.getBoost() = 1, q.getBoost() = 1 • sum of square weight = ∑ t in Q ( idf(t) )2 • queryNorm(Q) = 1/(0.59452+12) ½ =0.8596 • tf(t in d) = frequency½ • tf(you,D1) = 0, tf(say,D1) = 1, tf(hello,D1) = 2½ =1.4142 • tf(you,D2) = 0, tf(say,D2) = 1, tf(hello,D2) = 0 • idf(t) = ln (N/(nj+1)) + 1 • idf(you) = 0, idf(say) = ln(2/(2+1)) + 1 = 0.5945, idf(hello) = ln(2/(1+1)) +1 = 1 • norm(D) = 1/number of terms½ • norm(D1) = 1/6½ =0.4082, norm(D2) = 1/2½ =0.7071 • Score(Q, D1) = 2/3*0.8596*(1*0.59452+1.4142*12)*0.4082=0.4135 • Score(Q, D2) = 1/2*0.8596*(1*0.59452)*0.7071=0.1074 Spring 2008

More Related