160 likes | 173 Views
Explore multilingualism in TEL collections through bilingual dictionary adaptation & experiments. Discover approaches for merging, translation, feedback, and dictionary adaptation. Gain insights on multilingual retrieval and adaptation impacts.
E N D
XRCE Participation to TEL Jean-Michel Renders StephaneClinchant Xerox Research Center Europe 6 chemin de Maupertuis 38240 Meylan, France
Outline • Multilinguism • Bilingual Dictionary Adaptation • Some Experiments
Multilinguism and TEL Collections Text in German Text in English Text in French Document in French Multilingual Collection Document in German AND English
MultiLinguism • MultiLingual Documents: • Different Languages between documents and in the document • Relevant Documents could possibly be in any language ! • Monolingual Task is not Monolingual • Monolingual means : query language = Main Language of Collection • Bilingual means : query language != Main Language of Collection • Needs to translate queries for « monolingual » case • A possible Approach to MultiLinguism: • Index each Language separately • Late Fusion of Results
Our approach to Multilinguism for TEL corpus • Merge all the languages to a uniq meta-Language: • Words = ( French Words, English Words, German Words) • But « Gauguin » is not the same word in french than in german (Diff. Inverted List) • Build a uniq index for a collection ( 1 for BNF, 1 for BL , 1 for ONB) • Needs of a global multilingual translation of queries • ( != several cross- lingual translations) • Requires Merging of Bilingual Dictionaries. • Late Fusion of results → Early Fusion of Dictionaries • Prior Weights for merging resources
Collection Index Our strategy Thesaurus English to English Dictionary French To English P(wt|ws) First Translation of Query Dictionary German to English Query Retrieve Adapted Dictionary P’(wt|ws,q) new Translation of Query Retrieve and PRF
Dictionary based CLIR • Translate the query • using a probabilistic bilingual dictionary P(wt | ws) • β controlling amount of translation • And monolingual language model ..(Cross Entropy)
Dictionary Adaptation • Similar Idea found in Hiemstra (CLEF 2000) • Introduce our version for Domain Specific Track 07 • Main Idea: • Retrieval is disambiguating process • Relevant Documents contains the context of query terms translations: they are implicitly coherent …. • SO DO PSEUDO RELEVANT DOCUMENTS !
Monolingual Pseudo Feedback with LM • MIXTURE MODEL ( C.Zhai 2001 ) • For all documents in Pseudo Feedback Set F: • For i=1 to document length • Choose • the distribution of the relevant topic model ( θ ) • or the distribution of the corpus language model P(w|C) • Sample a word from that distribution. • Selected Words = the most probable words in θ
Bilingual Dictionary Adaptation • For all documents in Pseudo Feedback Set F: • For i=1 to document length • Choose • a query term (associated distribution from a dictionary) P(wt | qs) θst • or the distribution of the corpus language model P(w|C) • Sample a word from that distribution.
How to estimate st • The estimation of st is done by EM initializing it by the intial dictionary • New Translation Filtering Effect: - words not in the top F are filtered • |F| =50,100 - weights are reestimated • Idem for monolingual and thesaurus
Our official runs and our mistakes … • Lost relevant documents at indexing: • Kept Only English, French,German 240(BL) , 108(BNF) , 69(ONB) • Dictionary were not biased toward target collection but source language • Bad Translation of Queries : • β Parameter (Amount of Translation) identical for bilingual and monlingual runs
Some Pure Bilingual Experiments Resource:JRC corpus to extract dictionaries 3% Average Improvement
Conclusion • Multilinguism: Theory vs Practise • In theory seems a good idea • In practise, most best runs are “pure” monolingual or “pure” bilingual. • Dictionary Adaptation: • Partial solution to the problem of setting prior weights. • Got some improvements on sparse data this year • Partly Financed by