Comparing learning approaches to coreference resolution. There is more to it than ‘bias’.

Comparing learning approaches to coreference resolution. There is more to it than ‘bias’. Véronique Hoste and Walter Daelemans University of Antwerp

The “bias” of the machine learner • = the search heuristics a machine learning method uses and the way it represents the learned knowledge E.g. decision tree learners favor compact decision trees compare machine learning algorithms experimentally on specific tasks E.g. How do they treat exceptions or low-frequency cases?

Comparative ML studies 2 or more algorithms are compared for a fixed sample selection, feature selection and representation over a number of trials. Sometimes learning curves, limited parameter optimization. E.g. Mooney (96), Daelemans et al. (99), Lee and Ng (02), ...

Information sources feature selection feature representation (data transforms) Algorithm parameters Training data sample selection sample size (Banko & Brill) Interactions Algorithm parameters and sample selection Algorithm parameters and feature representation Feature representation and sample selection Sample size and feature selection Feature selection and algorithm parameters … What influences the outcome of a (comparative) ML experiment?

Experiment • Investigate the effect of • algorithm parameter optimization • feature selection (e.g. backward selection) • sample selection • interleaved feature selection and parameter optimization • … on the comparison of two inductive algorithms (lazy and eager) • … on the task of coreference resolution

Algorithms compared • Ripper • Cohen, 95 • Rule Induction • Algorithm parameters: different class ordering principles; negative conditions or not; loss ratio values; cover parameter values • TiMBL • Daelemans/Zavrel/van der Sloot/van den Bosch, 98 • Memory-Based Learning • Algorithm parameters: ib1, igtree; overlap, mvdm; 5 feature weighting methods; 4 distance weighting methods; 10 values of k

Motivation lazy versus eager • Natural language data sets: highly disjunctive • lazy learning approach allows extrapolation from low-frequency or exceptional cases • eager approach tends to treat these as discardable noise • Coreference resolution data sets: highly skewed class distribution Worst case: describe the data as one single class. Or classifier which tends to overfit the data.

Defining coreference Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will be able to disabbreviate the reference and thereby determine the identity of the entity. ANAPHOR

Defining coreference ANTECEDENT or REFERENT Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will be able to disabbreviate the reference and thereby determine the identity of the entity. ANAPHOR

Defining coreference ANTECEDENT or REFERENT Anaphora is the device of making in discourse an abbreviated reference to some entity in the expectation that the perceiver will be able to disabbreviate the reference and thereby determine the identity of the entity. ANAPHOR RESOLUTION

In 1983 Alfred Heineken and his driver were kidnapped. The kidnappers asked a ransom of 43 million guilders. A modest sum, they thought.

The monkey ate the banana because it was hungry. Der Affe aß die Banane weil er Hunger hatte. The monkey ate the banana because it was ripe. Der Affe aß die Banane weil sie reif war. The monkey ate the banana because it was lunch-time. Der Affe aß die Banane weil es Zeit zum essen war.

The task of coreference resolution • Data sets: MUC-6, MUC-7 (English), KNACK-2002 (Dutch) • linguistic preprocessing (tokenization, pos tagging, NP chunking, NE recognition) • Combination of positional, morphological, lexical, syntactic, string-matching and semantic features. • 10-fold cross-validation • precision, recall, F1

Free text Tokenization POS tagging NP chunking NER Relation finding Instance construction Google is een beursgenoteerd bedrijf. Google is een beursgenoteerd bedrijf. N(eigen) V(pv,tgw,ev) LID(onbep) ADJ N(soort) . [NP Google] [VP is] [NP een beursgenoteerd bedrijf]. I-ORG [SBJ Google] is [PREDC een beursgenoteerd bedrijf] .

The task of coreference resolution • Data sets: MUC-6, MUC-7 (English), KNACK-2002 (Dutch) • linguistic preprocessing (tokenization, pos tagging, NP chunking, NE recognition) • Combination of positional, morphological, lexical, syntactic, string-matching and semantic features. • 10-fold cross-validation • precision, recall, F-score

Algorithm bias Feature selection Feature represent. Algorithm parameters Sample selection Sample size Comparative experiment Algorithm A Algorithm B Algorithm ...

Default classifier results MUC-6 MUC-7 KNACK

Default classifier results • Ripper: • high precision scores. Set of rules which seeks to capture the specificity of the minority class. • lower recall scores. Sensitivity to skewedness? • Timbl: • large number of false positives. Sensitivity to low informative features? • Higher recall scores.

Searching the feature space • Hypothesis: • Especially Timbl will benefit from feature selection • Algorithm-comparing differences can be overruled by algorithm-internal performance differences • Selection procedures: • backward elimination : start with all features and remove the features which do not contribute to prediction • bidirectional hillclimbing (Caruana and Freitag 94): start with features with highest gain ratio and perform both backward and forward selection

Searching the feature space Timbl Ripper MUC-6 MUC-7 • Feature selection lifts the scores for Timbl with up to 35%. • With respect to the selection procedures and to the selected features, no general conclusions can be drawn

The effect of parameter optimization TiMBL (404 CV runs) Ripper (649 CV runs)

The effect of sample selection • Large class imbalances (eg. KNACK: 6.3% positive data) • Experiments: random downsampling, downsampling of true negatives; change of ratio false negatives/false positives

The effect of sample selection

Changing loss ratio in Ripper • Loss ratio parameter: allows to specify the relative • cost of false positives and false negatives • Focus on recall: loss ratio < 1 • Focus on precision: loss ratio > 1

Initial population Generate new population using crossover and mutation Best individual Population of candidate solutions Evaluation based on fitness Selection GAs for joint optimization - generational GA - evaluations distributed over cluster of computers using SGE queuing system - parameters: generations:30, population size:10, uniform cross-over (rate:0.9), tournament selection (selection size:2), discrete (rate: 0.2) and Gaussian mutation (k and loss ratio parameter)

GA individuals Feature weighting 0,1,2,3,4 Neighbour weighting 0,1,2,3 Values: 0,1,2 k 0 1 0 1 2 0 2 1 0 2 0 0 2 1 0 2 2 0 3 2 2.0288721872 Parameters Features

GA results (KNACK) Default GA optimization

Conclusion • Many factors can affect the success of a classifier: bias, the choice of algorithm parameters and information sources, the sample selection, the interaction between these factors, etc. • Optimization can lead to radically different results, causing much larger classifier-internal variations than classifier-comparing variations • Possible explanation for contradictory findings in ML of natural language literature • Optimization can lead to more reliable results and comparisons • GAs may be one solution

Conclusion (ctd.) • Similar observations for task of word sense disambiguation, the prediction of diminutive suffixes and POS tagging (Hoste et al. 2002, Daelemans et al. 2003, Decadt et al. 2004) • Not restricted to NLP tasks: similar results on UCI data sets (Daelemans et al. 2003)

Comparing learning approaches to coreference resolution. There is more to it than ‘bias’.