Using Semantic Relations to Improve Information Retrieval

Using Semantic Relations to Improve Information Retrieval Tom Morton

Introduction • NLP techniques have been largely unsuccessful at information retrieval. Why? • Document retrieval has been the primary measure of information retrieval success. • Document retrieval reduces the need for NLP techniques. • Discourse factors can be ignored. • Query words perform word-sense disambiguation. • Lack of robustness: • NLP techniques are typically not as robust as word indexing.

Introduction • Paragraph retrieval for natural-language questions. • Paragraphs can be influenced by discourse factors. • Correctness of answers to natural language questions can be accurately determined automatically. • Standard precursor to TREC question answering task. • What NLP technologies might help at this information retrieval task and are they robust enough?

Introduction • Question Analysis: • Questions tend to specify the semantic type of their answer. This component tries to identify this type. • Named-Entity Detection: • Named-entity detection determines the semantic type of proper nouns and numeric amounts in text.

Introduction • Question Analysis • The category predicted is appended to the question. • Named-Entity Detection: • The NE categories found in text are included as new terms. • This approach requires additional question terms to be in the paragraph. What party is John Major in? (ORGANIZATION) It probably won't be clearfor some time whether the Conservative Partyhas chosen in John Major a truly worthy successor to Margaret Thatcher, who has been a giant on the world stage. +ORGANIZATION +PERSON

Introduction • Coreference Relations: • Interpretation of a paragraph may depend on the context in which it occurs. • Syntactically-based Categorical Relation Extraction: • Appositive and predicate nominative constructions provide descriptive terms about entities.

Introduction • Coreference: • Use coreference relationships to introduce new terms referred to but not present in the paragraph’s text. How long was Margaret Thatcher the prime minister? (DURATION) The truth, which has been added to over each of her11 1/2 years in power, is that they don't make many like her anymore. +MARGARET +THATCHER +PRIME +MINISTER +DURATION

Introduction • Categorical Relation Extraction • Identifies DESCRIPTION category. • Allows descriptive terms to be used in term expansion. Who is Frank Lloyd Wright? (DESCRIPTION) What architect designed Robie House? (PERSON) Famed architect Frank Lloyd Wright… +DESCRIPTION Buildings he designed include the Guggenheim Museum in New York and Robie House in Chicago. +FRANK +LLOYD +WRIGHT +FAMED +ARCHITECT

Search Engine Introduction • Indexing • Retrieval Coreference Resolution Pre-processing Documents Categorical Relation Extraction NE Detection Paragraphs+ Question Analysis Paragraphs Question

Introduction • Will these semantic relations improve paragraph retrieval? • Are the implementations robust enough to see a benefit across large document collections and question sets? • Are there enough questions where these relationships are required to find an answer. • Questions need only be answered once. • Short Answer: Yes!

Overview • Introduction • Pre-processing • Named-Entity Detection • Coreference • Categorical Relation Extraction • Question Analysis • Paragraph Retrieval • Conclusion • Proposed Work

Preprocessing • Paragraph Detection • Sentence Detection • Tokenization • POS Tagging • NP-Chunking

Preprocessing • Paragraph finding: • Explicitly marked: • Newline, <p>, blank line, etc. • Implicitly marked: • What is the column width of this document? • Would this capitalized, likely sentence initial word fit on the previous line? • Sentence Detection: • Is this [.?!] the end of a sentence? • Use software developed in Reynar & Ratnaparki 97.

Preprocessing • Tokenization: • Are there additional tokens in this initial space-delimited set of tokens. • Use techniques described in Reyner 98. • POS Tagging: • Use software developed in Ratnaparki 96.

Preprocessing • NP-Chunking • Developed a maxent tagging model where each token is assigned a tag of either: • Start-NP, Continue-NP, Other • Software is very similar to the POS tagger. • Performance was evaluated to be at or near state-of-the-art.

Preprocessing • Producing Robust Components • Sentence, Tokenization and POS-tagging components we all retrained: • Added small samples of texts from the paragraph retrieval domains to the WSJ-based training data. • Allowed components to deal with editorial conventions which differed from the Wall Street Journal.

Overview • Introduction • Pre-processing • Named-Entity Detection • Coreference • Categorical Relation Extraction • Paragraph Retrieval • Question Analysis • Conclusion • Proposed Work

Named-Entity Detection • Task • Approach 1 • Approach 2

Named-Entity Detection • Task: • Identify the following categories: • Person, Location, Organization, Money, Percentage, Time Point. • Approach 1: • Use an existing NE-detector. • Performance on some genres of text was poor. • Couldn’t add new categories. • Couldn’t retrain the classifier.

Named-Entity Detection • Approach 2: • Train a maxent classifier on the output of an existing NE-detector. • Used BBN’s MUC NE tagger (Bikel et al. 1997) to create a corpus. • Combined Time and Date tags to create “Time Point” category. • Added a small sample of tagged text from the paragraph retrieval domains. • Constructed rule-based models for additional categories. • Distance and Amount

Coreference • Task • Approach • Results • Related Work

Coreference • Task: • Determine space of entity extents: • Basal noun phrases: • Named entities consisting of multiple basal noun phrases are treated as a single entity. • Pre-nominal proper nouns. • Possessive pronouns. • Determine which extents refer to the same entity in the world.

Coreference • Approach (Morton 2000) • Divide referring expressions into three classes • Singular third person pronouns. • Proper nouns. • Definite noun phrases. • Create separate resolution approach for each class. • Apply resolution approaches to text in an interleaved fashion.

Coreference • Singular Third Person Pronouns • Compare the pronoun to each entity in the current sentence and the previous two sentences. • Compute argmaxi( p(coref|pronoun,entityi)) using maxent model. • Compute p(nonref|pronoun) using maxent model. • If (p(corefi) > p(nonref)) then resolve pronoun

John Major, a truly worthy… • Margaret Thatcher, her, … • The Conservative Party • the undoubted exception • Winston Churchill • … 20% 70% ? she 10% 5% 10% Coreference • Pronoun is resolved to entity rather than most recent extent.

Coreference • Classifier Features: • Distance: • in NPs, Sentences, Left-To-Right, Right-To-Left • Syntactic Context: • NP’s position in sentence. • NP’s surrounding context. • Pronoun’s syntactic context. • Salience: • Number of times the entity has been mentioned. • Gender: • Pairings of the pronoun’s gender and the lexical items in entity.

Coreference • Proper Nouns: • Remove honorifics, corporate designators, determiners, and pre-nominal appositives. • Compare the proper noun to each entity preceding it. • Resolve it to the first preceding proper noun extent for which this proper noun is a substring (observing word boundaries). • Bob Smith <- Mr. Smith <- Bob <- Smith

Coreference • Definite Noun Phrases • Remove determines. • Resolve to first entity which shares the same head word and modifiers. • the big mean man <- the big man <- the man.

Coreference • Results: • Trained pronominal model on 200 WSJ documents with only pronouns annotated. • Interleaved with other resolution approaches to compute mention statistics. • Evaluated using 10-fold cross validation. • P 94.4%, R 76.0%, F 84.2%.

Coreference • Results: • Evaluated the proper noun and definite noun phrase approaches on 80 hand annotated WSJ files. • Proper Nouns P 92.1%, R 88.0%, F 90.0%. • Definite NPs P 82.5%, R 47.4%, F 60.2%. • Combined Evaluation: • MUC6 Coreference Task: • Annotation guidelines are not identical. • Ignored headline and dateline coreference. • Included appositives and predicate nominatives. • P 79.6%, R 44.5%, F 57.1%.

Coreference • Related Work • Ge et al. 1998: • Presents similar statistical treatment. • Assumes non-referential pronouns are pre-marked. • Assumes mention statistics are pre-computed. • Soon et al. 2001: • Targets MUC Tasks. • P 65.5-67.3%, R 56.1-58.3%, F 60.4-62.6%. • Ng and Cardie 2002: • Targets MUC Tasks. • P 70.8-78.0%, R 55.7-64.2%, F 63.1-70.4%. • Our approach favors precision over recall: • Coreference relationships are used in passage retrieval.

Categorical Relation Extraction • Task • Approach • Results • Related Work

Categorical Relation Extraction • Task • Identify whether a categorical relation exists between NPs in the following contexts: • Appositives: NP,NP. • Predicative Nominatives: NP copula NP. • Pre-nominal appositives: • (NP (SNP Japanese automaker) Mazda Motor Corp.)

Categorical Relation Extraction • Approach: • Appositives and predicate nominatives: • Create a single binary maxent classifier to determine when NP’s in the appropriate syntactic context express a categorical relationship. • Pre-nominal appositives: • Create a maxent classifier to determine where the split exists between the appositive and the rest of the noun phrase. • Use the lexical and POS-based features of noun phrases. • Use word/POS pair features. • Differentiate between head and modifier words. • Pre-nominal appositive classifier also use a word’s presence on a list of 69 titles as a feature.

Categorical Relation Extraction • Results • Appositives and predicate nominatives: • Training - 1000/1200 examples • Test 3 fold cross validation • Appositive - P 90.9% R 79.1% F 84.6%. • Predicate Nominatives – P 78.8% R 74.4% F 76.5%. • Pre-nominal appositives: • Training - 2000 examples • Used active learning to select new examples for annotation (884 positive). • Test - 1500 examples (81 positive) • P 98.6% R 85.2% F 91.4%.

Categorical Relation Extraction • Related Work • Soon et al. (2001) defines a specific feature to identify appositive constructions. • Hovy et al. (2001) uses syntactic patterns to identify “DEFINITION” and “WHY FAMOUS” types. • Our work is unique in that: • Statistical treatment of extracting categorical relations. • Uses categorical relations for term expansion in paragraph indexing.

Question Analysis • Task • Approach • Results • Related Work

Question Analysis • Task • Map natural language questions onto 10 categories: • Person, Location, Organization, Time Point, Duration, Money, Percentage, Distance, Amount, Description, Other • Where is West Point Military Academy? (Location) • When was ice cream invented? (Time Point) • How high is Mount Shasta? (Distance)

Question Analysis • Approach • Identify Question Word: • Who, What, When, When, Where, Why, Which, Whom, How (JJ|RB)*, Name. • Identify Focus Noun • Noun phrase which specifies the type of the answer. • Use a series of syntactic patterns to identify. • Train maxent classifier to predict which category the answer falls into.

Question Analysis • Focus Noun Syntactic Patterns • Who copula (np) • What copula* (np) • Which copula (np) • Which of (np) • How (JJ|RB) (np) • Name of (np)

Question Analysis • Classifier Features • Lexical Features • Question word, matrix verb, head noun of focus noun phrase, modifiers of the focus noun. • Word-class features • WordNet synsets and entry number of the focus noun. • Location of the focus noun. • Is it the last NP? • Who is (NP-Focus Colin Powell )?

Question Analysis • Question: • Which poet was born in 1572 and appointed Dean of St. Paul's Cathedral in 1621? • Features: • def qw=which verb=which_was rw=was rw=born rw=in rw=1572 rw=and rw=appointed rw=Dean rw=of rw=St rw=. rw=Paul rw='s rw=Cathedral rw=in rw=1621 rw=? hw=poet ht=NN s0=poet1 s0=writer1 s0=communicator1 s0=person1 s0=life_form1 s0=causal_agent1 s0=entity1 fnIsLast=false

Question Analysis • Results: • Training: • 1888 hand-tagged examples from web-logs and web searches. • Test: • TREC8 Questions – 89.0%. • TREC9 Questions – 76.6%.

Question Analysis • Related Work • Ittycheriah et al. 2001: • Similar: • Uses maximum entropy model. • Uses focus nouns and WordNet. • Differs: • Assumes first NP is the focus noun. • 3300 annotated questions. • Uses MUC NE categories plus PHRASE and REASON. • Uses feature selection with held-out data.

Paragraph Retrieval • Task • Approach • Results • Related Work

Paragraph Retrieval • Task • Given a natural language question: • TREC-9 question collection. • A collection of documents: • ~1M documents: • AP, LA Times, WSJ, Financial Times, FBIS, and SJM. • Return a paragraph which answers the question. • Used TREC-9 answer patterns to evaluate.

Using Semantic Relations to Improve Information Retrieval

Using Semantic Relations to Improve Information Retrieval

Presentation Transcript

Information Retrieval and the Semantic Web

Information Retrieval on the Semantic Web Using Ontology-based Visualization

A Self-organizing Semantic Map for Information Retrieval

Indexing with semantic components improve information retrieval in domain-specific web portal

Semantic Information Retrieval from Distributed Heterogeneous Data Sources

Using Genetic Information to Improve Health

Using Weather Information to Improve Route

Generalizing semantic relations

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

Information Retrieval Using SQL

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications

EXPERT SYSTEM TO CLASSIFY SEMANTIC INFORMATION TO IMPROVE MAP DESIGN

Using Information to Improve Distributor Performance

Using Blog Properties to Improve Retrieval

Using Social Annotations to Improve Language Model for Information Retrieval

Introduction to Information Retrieval

Using Non-Taxonomic Knowledge to Improve Semantic Matching

Multilingual Information Retrieval using GHSOM

Ontologizing Semantic Relations

Using Semantic Relations to Improve Passage Retrieval for Question Answering

Using Genetic Information to Improve Health