120 likes | 211 Views
A Model for Learning Words by Crawling the Web. Jeff Thomson, Sygys.com Rex Gantenbein, University of Wyoming. Overview. Goal: create an autonomous language learning system Use Web crawler technology Extract meaning from paragraphs and sentences to create language understanding Major issues
E N D
A Model for Learning Words by Crawling the Web Jeff Thomson, Sygys.com Rex Gantenbein, University of Wyoming CAINE November 2009
Overview • Goal: create an autonomous language learning system • Use Web crawler technology • Extract meaning from paragraphs and sentences to create language understanding • Major issues • Irregularity of natural language constructions • Understanding paragraphs and sentences • Determining meaning of new words CAINE November 2009
Handling irregularities • Most major parts of a language (English, anyway) can be generalized • Exceptions require preprocessing to fit them into generalizable categories • Example: Inflectional endings on verbs bat is bats am batting are batted was CAINE November 2009
Handling irregularities • Idiomatic phrases require understanding of the entire phrase in a colloquial context “Go jump in the lake” vs. “Go cook yourself an egg” • Pronoun resolution “Three boys each bought a pizza. They ate them in the park.” CAINE November 2009
Extracting understanding • Paragraph understanding • Matching paragraph structure to common forms • Finding the nucleus of the paragraph’s meaning • Sentence understanding • Matching sentence structure to common forms • Determining the meaning of the words in the sentence CAINE November 2009
Our approach • Exception-first processing • Preprocessing to handle irregularities • Linguistic classifications based on tree structure CAINE November 2009
Our approach • Parser (incorporated into Web crawler) to determine structure • Some structures are disregarded when keywords are already classified • Word classification • Type, gender, number • Unknown words are analyzed according to rules using placement in sentence and surrounding classified words CAINE November 2009
Our approach • Keyword recognition • Use “word chains” (sequences of words) with application of linguistic knowledge • Word-level understanding • Reduce words to root form to process them as keywords • Reduce irregular forms using an exception database created at preprocessing CAINE November 2009
System model • Exception database • Separates generalizable and exception verbs • Processes word endings • Scans exception database for exception • Processes “normal” words according to rules CAINE November 2009
System model • Categorization generator • Separates generalizable and exception words • Processes word endings • Scans exception database for exceptions and processes these first • Processes “normal” words according to rules • Sentence parser with disregard capacity • Paragraph understanding rules CAINE November 2009
System model • Web crawler searches for source material • Processes the material and enhances its own rules and exceptions • Eventually will learn enough to understand most material in a given language • Future work • Implement a pilot version of this system • Determine how to control for a “given” language CAINE November 2009
Questions? CAINE November 2009