330 likes | 421 Views
Morpho Challenge in Pascal Challenges Workshop Venice, 12 April 2006 Morfessor in the Morpho Challenge. Mathias Creutz and Krista Lagus Helsinki University of Technology (HUT) Adaptive Informatics Research Centre. Source: Creutz & Lagus, 2005 tech.rep. Challenge for NLP: too many words.
E N D
Morpho Challenge in Pascal Challenges WorkshopVenice, 12 April 2006Morfessor in the Morpho Challenge MathiasCreutz and KristaLagus Helsinki University of Technology (HUT) Adaptive Informatics Research Centre
Source: Creutz & Lagus, 2005 tech.rep. Challenge for NLP: too many words • E.g., Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes: • kahvi + n + juo + ja + lle + kin (coffee + of + drink + -er + for + also) • nyky+ ratkaisu + i + sta + mme (current + solution + -s + from + our) • tietä + isi + mme + kö+ hän (know + would + we + INTERR + indeed) • Huge number of word forms, few examples of each • By splitting we get fewer basic units, each with more examples • Important to know the inner structure of words
Solution approaches • Hand-made morphological analyzers (e.g., based on Koskenniemi’s TWOL = two-level morphology) • accurate • labour-intensive construction, commercial, coverage, updating when languages change, addition of new languages • Data-driven methods, preferably minimally supervised (e.g., John Goldsmith’s Linguistica) • adaptive, language-independent • lower accuracy • many existing algorithms assume few morphemes per word, unsuitable for compounds and multiple affixes
Goal: segmentation Morfessor • Learnrepresentations of • the smallest individually meaningful units of language (morphemes) • and their interaction • in an unsupervised and data-driven manner from raw text • making as general and as language-independent assumptions as possible. • Evaluate • against a gold-standard morphological analysis of word forms • integrated in NLP applications (e.g. speech recognition) Hutmegs
believ hop liv mov us e ed es ing Further challenges in morphology learning • Beyond segmentation: allomorphy (“foot – feet, goose – geese”) • Detection of semantic similarity (“sing – sings – singe – singed”) • Learning of paradigms (e.g., John Goldsmith’s Linguistica)
Linguistic evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard) • Hutmegs contains gold standard segmentations obtained by processing the morphological analyses of FinTWOL and CELEX • 1.4 million Finnish word forms (FInTWOL, from Lingsoft Inc.) • Input: ahvenfileerullia (perch filet rolls) • FINTWOL: ahven#filee#rulla N PTV PL • Hutmegs: ahven + filee + rull + i + a • 120 000 English word forms (CELEX, from LDC) • Input: housewives • CELEX: house wife, NNx, P • Hutmegs: house + wive + s • Publicly available, see M. Creutz and K. Lindén. 2004. Morpheme Segmentation Gold Standards for Finnish and English.
Morfessor models in the Challenge • Morfessor Baseline (2002) • Program code available since 2002 • Provided as a baseline model for the Morpho Challenge • Improves speech recognition; experiments since 2003 • No model of morphotactics • Morfessor Categories ML (2004) • Category-based modeling (HMM) of morphotactics • No speech recognition experiments before this challenge • No public software yet • Morfessor Categories MAP (2005) • More elegant mathematically M1 M2 M3
Avoiding overlearning by controlling model complexity • When using powerful machine learning methods, overlearning is always a problem • Occam’s razor: given two equally accurate theories, choose the one that is less complex • We have used: • Heuristic control affecting the size of the lexicon • Deriving a cost function that incorporates a measure of model size, using • MDL (Minimum Description length) • MAP learning (Maximum A Posteriori) M2 M1 M3
P (M | corpus ) P (M) P (corpus | M) where M = (lexicon, grammar) and therefore = P (lexicon) P (corpus | lexicon) = P () P () letters morphs Morfessor Baseline M1 • Originally called the ”Recursive MDL method” • Optimizes roughly: • + MDL based cost function optimizes size of the model • - Morph contextual information not utilized • undersegmentation of frequent strings (“forthepurposeof”) • oversegmentation of rare strings (“in + s + an + e”) • syntactic / morphotactic violations(“s + can”)
Randomly shuffle words Recursive binary splitting words opening openminded openminded reopened reopened conferences reopen minded Morphs mind open re ed Search for the optimal model M1 Convergence of model prob.? yes Done no
Winners Morfessor Baseline: M1 Challenge Results: Comparison to gold standard splitting (F-measures)
P(STM | PRE) P(SUF | SUF) Transition probs P(’over’ | PRE) P(’s’ | SUF) Emission probs # over simpl ific ation s # Morfessor- Categories – ML & MAP M2 M3 • Lexicon / Grammar dualism • Word structure captured by a regular expression: word = ( prefix* stemsuffix* )+ • Morph sequences (words) are generated by a Hidden Markov model (HMM): • Lexicon: morpheme properties and contextual properties • Morph segmentation is initialized using M1
17259 14029 41 1 136 4 4618 1 1 4 5 1 over s simpl Right perplexity Left perplexity Frequency Length String Morphs ... Morph lexicon M2 M3 Form Morph distributional features
How morph distributional features affect morph categories • Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’) • Assume asymmetries between the categories:
How distributional features affect categories (2) • Distribute remaining probability mass proportionally, • e.g., • There is an additional non-morpheme category for cases where none of the proper classes is likely:
14029 136 1 4 over 17259 1 4618 1 s 41 4 1 5 simpl P(STM | PRE) P(SUF | SUF) P(’over’ | PRE) P(’s’ | SUF) ... s # over simpl ation # ific MAP vs. ML optimization Morfessor Categories-ML: M2 Control lexicon size heuristically arg max P(Corpus | Lexicon) Lexicon Morfessor Categories-MAP: M3
Hierarchical structures in lexicon M3 straightforwardness • Maintain the hierarchy of splittings for each word • Ability to code efficiently also common substrings which are not morphemes (e.g. syllables in foreign names) • Bracketed output straightforward ness Suffix straight forward Stem for ward Non-morpheme
Winner Committees Morfessor Categories models: M2 and M3 Morfessor Baseline: M1 Challenge Results: Comparison to gold standard splitting (F-measures)
Speech recognition results: Finnish Morfessors: M1, M2, M3 Committees Competitors
Speech recognition results: Turkish Morfessors: M1, M2, M3 Committees
Source: Creutz & Lagus, 2005 tech.rep. A reason for differences?
Discussion • This was the first time our Category methods were evaluated in speech recognition, with nice results! • Comparison with Morfessors and challenge participants is not quite fair • Possibilities to extend the M3 model • add word contextual features for “meaning” • more fine-grained categories • beyond concatenative phenomena (e.g., goose – geese) • allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful)
Questions for the Morpho Challenge • How language-general in fact are the methods? • Norwegian, French, German, Arabic, ... • Did we, or can we succeed in inducing ”basic units of meaning”? • Evaluation in other NLP problems: MT, IR, QA, TE, ... • Application of morphs to non-NLP problems? Machine vision, image analysis, video analysis ... • Will there be another Morpho Challenge?
See you in another challenge! • best wishes, Krista (and Sade)
Muistiinpanojani • kuvaa lyhyesti omat menetelmät • pohdi omien menetelmien eroja suhteessa niiden ominaisuuksiin • ole nöyrä, tuo esiin miksi vertailu on epäreilu (aiempi kokemus; oma data; ja puh.tunnistuskin on tuttu sovellus, joten sen ryhmän aiempi tutkimustyö on voinut vaikuttaa menetelmänkehitykseemme epäsuorasti) + pohdintaa meidän menetelmien eroista? + esimerkkisegmentointeja kaikistamme? + Diskussiokamaa Mikon paperista ja meidän paperista + eka tuloskuva on nyt sekava + värit eri tavalla kuin muissa: vaihda värit ja tuplaa, nosta voittajaa paremmin esiin
Discussion • Possibility to extend the model • rudimentary features used for “meaning” • more fine-grained categories • beyond concatenative phenomena (e.g., goose – geese) • allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful) • Already now useful in applications • automatic speech recognition (Finnish, Turkish)
Overview of methods • Machine learning methodology • FEATURES USED:information on • morph contexts (Bernard, Morfessor) • word contexts (Bordag)
Morpho project page krista lagus: http://www.cis.hut.fi/projects/morpho/