Automatic Classification of Text Databases Through Query Probing

Automatic Classification of Text Databases Through Query Probing Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.

Search-only Text Databases • Sources of valuable information • Hidden behind search interfaces • Non-crawlable Example: Microsoft Support KB

Interacting With Searchable Text Databases • Searching: Metasearchers • Browsing: Use Yahoo-like directories • Browse & search: “Category-enabled” metasearchers

Searching Text Databases: Metasearchers • Select the good databases for a query • Evaluate the query at these databases • Combine the query results from the databases Examples: MetaCrawler, SavvySearch, Profusion

Browsing Through Text Databases • Yahoo-like web directories: • InvisibleWeb.com • SearchEngineGuide.com • TheBigHub.com Example from InvisibleWeb.com Computers > Publications > ACM DL • Category-enabled metasearchers • User-defined category (e.g. Recipes)

Problem With Current Classification Approach • Classification of databases is done manually • This requires a lot of human effort!

How to Classify Text Databases Automatically: Outline • Definition of classification • Strategies for classifying searchable databases through query probing • Initial experiments

Database Classification: Two Definitions • Coverage-based classification: • The database contains many documents about the category (e.g. Basketball) • Coverage: #docs about this category • Specificity-based classification: • The database contains mainly documents about this category • Specificity: #docs/|DB|

Database Classification: An Example • Category: Basketball • Coverage-based classification • ESPN.com, NBA.com • Specificity-based classification • NBA.com, but not ESPN.com

Categorizing a Text Database:Two Problems • Find the category of a given document • Find the category of all the documents inside the database

Categorizing Documents • Several text classifiers available • RIPPER (AT&T Research, William Cohen 1995) • Input: A set of pre-classified, labeled documents • Output: A set of classification rules

Categorizing Documents: RIPPER • Training set: Preclassified documents • “Linux as a web server”: Computers • “Linux vs. Windows: …”: Computers • “Jordan was the leader of Chicago Bulls”: Sports • “Smoking causes lung cancer”: Health • Output: Rule-based classifier • IF linux THEN Computers • IF jordan AND bulls THEN Sports • IF lung AND cancer THEN Health

Precision and Recall of Document Classifier During the training phase: • 100 documents about computers • “Computer” rules matched 50 docs • From these 50 docs 40 were about computers • Precision = 40/50 = 0.8 • Recall = 40/100 = 0.4

From Document to Database Classification • If we know the categories of all the documents, we are done! • But databases do not export such data! How can we extract this information?

Our Approach: Query Probing • Design a small set of queries to probe the databases • Categorize the database based on the probing results

Designing and Implementing Query Probes The probes should extract information about the categories of the documents in the database • Start with a document classifier (RIPPER) • Transform each rule into a query IF lung AND cancer THEN health  +lung +cancer IF linux THEN computers  +linux • Get number of matches for each query

Three Categories and Three Databases linux computers ACM DL jordan AND bulls sports lung AND cancer health NBA.com PubMED

Using the Results for Classification We use the results to estimatecoverage and specificity values

Adjusting Query Results • Classifiers are not perfect! • Queries do not “retrieve” all the documents that belong to a category • Queries for one category “match” documents that do not belong to this category • From the training phase of classifier we use precision and recall

Precision & Recall Adjustment • Computer-category: • Rule: “linux”, Precision = 0.7 • Rule: “cpu”, Precision = 0.9 • Recall (for all the rules) = 0.4 • Probing with queries for “Computers”: • Query: +linux  X1 matches  0.7X1 correct matches • Query: +cpu  X2 matches  0.9X2 correct matches • From X1+X2documents found: • Expect 0.7 X1+0.9 X2to be correct • Expect (0.7 X1+0.9 X2)/0.4 total computer docs

Initial Experiments • Used a collection of 20,000 newsgroup articles • Formed 5 categories: • Computers (comp.*) • Science (sci.*) • Hobbies (rec.*) • Society (soc.* + alt.atheism) • Misc (misc.sale) • RIPPER trained with 10,000 newsgroup articles • Classifier: 29 rules, 32 words used • IF windows AND pc THEN Computers (precision~0.75) • IF satellite AND space THEN Science (precision~0.9)

Web-databases Probed • Using the newsgroup classifier we probed four web databases: • Cora (www.cora.jprc.com) CS Papers archive (Computers) • American Scientist (www.amsci.org) Science and technology magazine (Science) • All Outdoors (www.alloutdoors.com) Articles about outdoor activities (Hobbies) • Religion Today (www.religiontoday.com) News and discussion about religions (Society)

Results • Only 29 queries per web site • No need for document retrieval!

Conclusions • Easy classification using only a small number of queries • No need for document retrieval • Only need a result like: “X matches found” • Not limited to search-only databases • Every searchable database can be classified this way • Not limited to topical classification

Current Issues • Comprehensive classification scheme • Representative training data

Future Work • Use a hierarchical classification scheme • Test different search interfaces • Boolean model • Vector-space model • Different capabilities • Compare with document sampling (Callan et al.’s work – SIGMOD99, adapted for the classification task) • Study classification efficiency when documents are accessible

Related Work • Gauch (JUCS 1996) • Etzioni et al. (JIIS 1997) • Hawking & Thistlewaite (TOIS 1999) • Callan et al. (SIGMOD 1999) • Meng et al. (CoopIS 1999)

Automatic Classification of Text Databases Through Query Probing

Automatic Classification of Text Databases Through Query Probing

Presentation Transcript

Automatic Text Classification

Text Classification

Modeling Query-Based Access to Text Databases

Query Classification

TEXT CLASSIFICATION

Text Classification

Text Classification

Automatic Categorization of Query Results

Text Classification

Text Classification

Text Databases

Introduction to Automatic Text Classification

Text Classification

Text Classification

Classification Text

Modeling Query-Based Access to Text Databases

Automatic Text Classification through Machine Learning

Text Databases

Text Classification

Automatic Classification of Text Databases Through Query Probing

TEXT CLASSIFICATION