160 likes | 258 Views
Analyzing Document Retrievability in Patent Retrieval Settings. Shariq Bashir, and Andreas Rauber DEXA 2009, Linz, Austria, 31 August – 4 September. Department of Software Technology and Interactive Systems Vienna University of Technology, Austria {bashir, rauber}@ifs.tuwien.ac.at.
E N D
Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir, and Andreas Rauber DEXA 2009, Linz, Austria, 31 August – 4 September Department of Software Technology and Interactive Systems Vienna University of Technology, Austria {bashir, rauber}@ifs.tuwien.ac.at
Motivation Patent retrieval is a emerging & challenging area. Patents fall into legal category, use to protect inventions. • Patents are Complex • Patents have large document length. • Contain complex vocabulary. • Contain complex structure and technical contents. • Patent writers often intentionally use vague words and expressions, in order to pass their patents from examination test. • This creates serious word mismatch problems. • Relevant patents could not be findable from their relevant queries. • Users (Attorneys, Patent examiners) mostly use hundreds of queries for • Patent Retrieval is different to Web Retrieval • Patent retrieval is recall oriented domain. • Finding all relevant patents is considered more important than finding only small set of top relevant patents. • Exp: A single prior-art patent can invalidate the application of new patent, • but can we find such patent in given retrieval model?
Motivation • Role of Retrieval System in Accessing Information • Generally, there is always argue on the quality of user queries. • Therefore, rather than arguing on the quality of user queries. • In this paper, we check the role of retrieval systems in accessing information. • Can we access all information using given Retrieval Model? • How much retrieval system’s bias restrict our access to information? • Are there some subsets in given collection, which could not be find? • How easily we can find information in given retrieval system?
Document Retrievability (aka Findability) • We measure retrieval systems effectiveness using findability measure. • Findability Measure • Measures how easily a retrieval model can find all documents. • Findability is measured with top c results. (e.g. c = 35, c = 80 etc). • Can figure out which retrieval systems is better for finding patents. • Can figure out high/low findable subsets in the collection. • Can figure out non-findable subsets in the collection.
Given a collection of documents D with large set of Queries Q. The findability of document d1 is, how many times we can access d in top-c results, with all queries in Q. Exp: If a document d1 in findable in top-c of query q1, findability score r(d1) = 1. kdq is the rank of dD in query qQ. f(kdq,c)returns a value of 1 if kdq<= c, and 0 otherwise. Computing Findability Measure
Our Contribution • Findability is measured with single score across all queries. • We consider relevance of queries, analyzing • Findability across all queries • Findability considering only queries that the document is relevant for • Findability for queries that a document is NOT relevant for • Characteristics of high/low findable documents • To what extend we can increase the findability of documents
Experiment Setup • Retrieval models used • TFIDF, BM25, BM25F, Exact Match • Patents from US Patent and Trademark website http://www.uspto.gov • USPC class 433 - Dentistry Domain • For query generation, we used only Claim section • For indexing and searching we used all sections • Title, Abstract, Claim, Background Summary, Description, Captions • We used cut-off rank factor c = 35.
Query Generation • Queries based on patent invalidity search scenario • Extract all single terms from individual patents term frequency > 2 in claim section • Single terms expanded into two & three term combinations • A query is considered relevant for patent, if all its terms appear at least 3 times in a document
Conclusion • We analyze patents retrieval with findability measure. • We differentiate findability using relevant & irrelevant queries. • Our results indicate that • With well-known retrieval models, we could not able to find some patents in top-c results. • Large retrieval patents are more findable from irrelevant queries than relevant queries. • There is lot of noise on Top-c results of queries. • Future Work • For handling word mismatch, we need efficient Query Expansion technique. • Individual patents have different findability scores in different retrieval models. • Exp: Patents which are low findable in Model A, are high findable in Model B. • We need efficient Fusion technique.