The Knowledge Acquisition Bottleneck Revisited: How can we build large KBs?

The Knowledge Acquisition Bottleneck Revisited:How can we build large KBs? Illustrations of different approaches Peter Clark and John Thompson Boeing Research 2004

Premise • Intelligent machines needs lots of knowledge, for • question-answering • intelligent search • information integration • natural language understanding • decision support • modeling • etc. etc. • Much of this knowledge can be drawn from some general repository of reusable knowledge • e.g., WordNet • How does one build such a repository? “No-one considers hand-building a large KB to be a realistic proposition these days” [paraphrase of Daphne Koller, 2004]

1. Build it by Hand • “Let’s roll up our sleeves and get on with it!” • But: It’s a daunting task • Our own work • Cyc + Lots in it, (Relatively) well designed ontology - 650 person-years effort so far - Still patchy coverage (why?) • Difficult to use outside Cycorp

1. Build it by Hand (cont) • WordNet + Easy to use + Comprehensive • Little inference-supporting knowledge in • Ad hoc ontology

1. Build it by Hand (cont) • The Component Library Claim: can bound the required knowledge by working at a coarse-grained level + Large, more doable • Hard to use, still very incomplete

2. Extract from Dictionaries - MindNet + Automatically built • Unusable? • Extended WordNet + Won TREC competition - Still somewhat incoherent • Lot of manual labor

3. Corpus-based Text/Web Mining - Schubert’s system + Automatic + Lots of knowledge • Noisy • No word senses • Only grabs certain kinds of knowledge 30M entries…

3. Corpus-based Text/Web Mining (cont) - KnowIt (Etsioni) + automatic • only factoids

4. Community-Based Acquisition • Knowledge entry by the masses • OpenMind + Large • Full of junk, unusable (?) • Would this work with better acquisition tools? (see next slide for illustration)

5. Use Existing Resources • e.g., • databases • CIA World Fact Book • Web data/services • e.g., SRI/ISI’s ARDA QA system + Syntactically simple + Available • Largely limited to factoids • Information integration is a major challenge • different ontologies, contradictory data

Where to? • Can we bound the knowledge needed • for a particular application • for a useful, sharable, general resource? • Which of these approaches seems most realistic? • build by hand • extract from dictionaries • mine text corpora • community knowledge entry • use existing resources

The Knowledge Acquisition Bottleneck Revisited: How can we build large KBs?