A Two-Stage Approach to Domain Adaptation for Statistical Classifiers

A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign

What is domain adaptation? CIKM2007

Example: named entity recognition persons, locations, organizations, etc. train (labeled) test (unlabeled) standard supervised learning NER Classifier 85.5% New York Times New York Times CIKM2007

labeled data not available New York Times Reuters Example: named entity recognition persons, locations, organizations, etc. train (labeled) test (unlabeled) non-standard (realistic) setting NER Classifier 64.1% New York Times CIKM2007

Domain difference  performance drop train test ideal setting NER Classifier NYT NYT 85.5% New York Times New York Times realistic setting NER Classifier NYT Reuters 64.1% Reuters New York Times CIKM2007

Another NER example train test ideal setting gene name recognizer 54.1% mouse mouse realistic setting gene name recognizer 28.1% fly mouse CIKM2007

Other examples • Spam filtering: • Public email collection  personal inboxes • Sentiment analysis of product reviews • Digital cameras  cell phones • Movies  books • Can we do better than standard supervised learning? • Domain adaptation: to design learning methods that are aware of the training and test domain difference. CIKM2007

How do we solve the problem in general? CIKM2007

Observation 1 domain-specific features wingless daughterless eyeless apexless … CIKM2007

Observation 1 domain-specific features wingless daughterless eyeless apexless … • describing phenotype • in fly gene nomenclature • feature “-less” weighted high CD38 PABPC5 … feature still useful for other organisms? No! CIKM2007

…decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. Observation 2 generalizable features CIKM2007

…decapentaplegic and winglessare expressed in analogous patterns in each… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. Observation 2 generalizable features feature “X be expressed” CIKM2007

General idea: two-stage approach domain-specific features Source Domain Target Domain generalizable features features CIKM2007

Goal Source Domain Target Domain features CIKM2007

Regular classification Source Domain Target Domain features CIKM2007

Generalization:to emphasize generalizable features in the trained model Source Domain Target Domain features Stage 1 CIKM2007

Adaptation:to pick up domain-specific features for the target domain Source Domain Target Domain features Stage 2 CIKM2007

Regular semi-supervised learning Source Domain Target Domain features CIKM2007

Comparison with related work • We explicitly model generalizable features. • Previous work models it implicitly [Blitzer et al. 2006, Ben-David et al. 2007, Daumé III 2007]. • We do not need labeled target data but we need multiple source (training) domains. • Some work requires labeled target data [Daumé III 2007]. • We have a 2nd stage of adaptation, which uses semi-supervised learning. • Previous work does not incorporate semi-supervised learning [Blitzer et al. 2006, Ben-David et al. 2007, Daumé III 2007]. CIKM2007

Implementation of the two-stage approach with logistic regression classifiers CIKM2007

Logistic regression classifiers 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 0 1 0 0 1 : : 0 1 0 -less p binary features X be expressed … and wingless are expressed in… wy wyTx x CIKM2007

Learning a logistic regression classifier 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 0 1 0 0 1 : : 0 1 0 regularization term penalize large weights control model complexity log likelihood of training data wyTx CIKM2007

Generalizable features in weight vectors D1 D2 DK K source domains 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 3.2 0.5 4.5 -0.1 3.5 : : 0.1 -1.0 -0.2 0.1 0.7 4.2 0.1 3.2 : : 1.7 0.1 0.3 domain-specific features … generalizable features w1 w2 wK CIKM2007

We want to decompose w in this way h non-zero entries for h generalizable features 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 0 0 4.6 0 3.2 : : 0 0 0 0.2 4.5 0.4 -0.3 -0.2 : : 2.1 -0.9 0.4 = + CIKM2007

Feature selection matrix A 0 1 0 0 1 : : 0 1 0 matrix A selects h generalizable features 0 0 1 0 0 … 0 0 0 0 0 1 … 0 : : 0 0 0 0 0 … 1 0 1 : : 0 h z = Ax A x CIKM2007

Decomposition of w weights for domain-specific features weights for generalizable features 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 0 1 0 0 1 : : 0 1 0 0.2 4.5 0.4 -0.3 -0.2 : : 2.1 -0.9 0.4 0 1 0 0 1 : : 0 1 0 4.6 3.2 : : 3.6 0 1 : : 0 + = wTx = vTz + uTx CIKM2007

Decomposition of w wTx = vTz + uTx = vTAx + uTx = (Av)Tx + uTx w = ATv + u CIKM2007

Decomposition of w shared by all domains domain-specific 0.2 4.5 5 -0.3 3.0 : : 2.1 -0.9 0.4 0.2 4.5 0.4 -0.3 -0.2 : : 2.1 -0.9 0.4 0 0 … 0 0 0 … 0 1 0 … 0 0 0 … 0 0 1 … 0 : : 0 0 … 0 0 0 … 0 0 0 … 0 4.6 3.2 : : 3.6 = + w = ATv + u CIKM2007

Source Domain Target Domain Framework for generalization Fix A, optimize: regularization term likelihood of labeled data from K source domains wk λs >> 1: to penalize domain-specific features CIKM2007

Source Domain Target Domain Framework for adaptation Fix A, optimize: likelihood of pseudo labeled target domain examples λt = 1 << λs : to pick up domain-specific features in the target domain CIKM2007

How to find A? (1) • Joint optimization • Alternating optimization CIKM2007

How to find A? (2) • Domain cross validation • Idea: training on (K – 1) source domains and test on the held-out source domain • Approximation: • wfk: weight for feature f learned from domain k • wfk: weight for feature f learned from other domains • rank features by • See paper for details CIKM2007

Intuition for domain cross validation domains D1 D2 … Dk-1 Dk (fly) w 1.5 0.05 w 2.0 1.2 … … … expressed … … … -less … … -less … … expressed … … 1.8 0.1 … … … expressed … … … -less CIKM2007

Experiments • Data set • BioCreative Challenge Task 1B • Gene/protein name recognition • 3 organisms/domains: fly, mouse and yeast • Experiment setup • 2 organisms for training, 1 for testing • F1 as performance measure CIKM2007

Source Domain Target Domain Source Domain Target Domain Experiments: Generalization using generalizable features is effective F: fly M: mouse Y: yeast domain cross validation is more effective than joint optimization CIKM2007

Source Domain Source Domain Target Domain Target Domain Experiments: Adaptation F: fly M: mouse Y: yeast domain-adaptive bootstrapping is more effective than regular bootstrapping CIKM2007

Experiments: Adaptation domain-adaptive SSL is more effective, especially with a small number of pseudo labels CIKM2007

Conclusions and future work • Two-stage domain adaptation • Generalization: outperformed standard supervised learning • Adaptation: outperformed standard bootstrapping • Two ways to find generalizable features • Domain cross validation is more effective • Future work • Single source domain? • Setting parameters h and m CIKM2007

References • S. Ben-David, J. Blitzer, K. Crammer & F. Pereira. Analysis of representations for domain adaptation. NIPS 2007. • J. Blitzer, R. McDonald & F. Pereira. Domain adaptation with structural correspondence learning. EMNLP 2006. • H. Daumé III. Frustratingly easy domain adaptation. ACL 2007. CIKM2007

Thank you! CIKM2007

A Two-Stage Approach to Domain Adaptation for Statistical Classifiers

A Two-Stage Approach to Domain Adaptation for Statistical Classifiers

Presentation Transcript

A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation

Domain Adaptation

Statistical Approach

Machine Translation Domain Adaptation

Statistical Approach to Classification

A statistical approach to surrogate data

A Balanced Ensemble Approach to Weighting Classifiers for Text Classification

Regularized Adaptation for Discriminative Classifiers

Stage Two

A two-stage process

Statistical Approaches to Regional climate Models for Adaptation

Domain Adaptation for Biomedical Information Extraction

Learning to Use a Learned Model: A Two-Stage Approach to Classification

Statistical approach

Optimal Adaptation for Statistical Classifiers

Domain Adaptation for Statistical Machine Translation

Learning to Align: a Statistical Approach

A Balanced Ensemble Approach to Weighting Classifiers for Text Classification