XRCE Participation to TEL

XRCE Participation to TEL Jean-Michel Renders StephaneClinchant Xerox Research Center Europe 6 chemin de Maupertuis 38240 Meylan, France

Outline • Multilinguism • Bilingual Dictionary Adaptation • Some Experiments

Multilinguism and TEL Collections Text in German Text in English Text in French Document in French Multilingual Collection Document in German AND English

MultiLinguism • MultiLingual Documents: • Different Languages between documents and in the document • Relevant Documents could possibly be in any language ! • Monolingual Task is not Monolingual • Monolingual means : query language = Main Language of Collection • Bilingual means : query language != Main Language of Collection • Needs to translate queries for « monolingual » case • A possible Approach to MultiLinguism: • Index each Language separately • Late Fusion of Results

Our approach to Multilinguism for TEL corpus • Merge all the languages to a uniq meta-Language: • Words = ( French Words, English Words, German Words) • But « Gauguin » is not the same word in french than in german (Diff. Inverted List) • Build a uniq index for a collection ( 1 for BNF, 1 for BL , 1 for ONB) • Needs of a global multilingual translation of queries • ( != several cross- lingual translations) • Requires Merging of Bilingual Dictionaries. • Late Fusion of results → Early Fusion of Dictionaries • Prior Weights for merging resources

Collection Index Our strategy Thesaurus English to English Dictionary French To English P(wt|ws) First Translation of Query Dictionary German to English Query Retrieve Adapted Dictionary P’(wt|ws,q) new Translation of Query Retrieve and PRF

Dictionary based CLIR • Translate the query • using a probabilistic bilingual dictionary P(wt | ws) • β controlling amount of translation • And monolingual language model ..(Cross Entropy)

Dictionary Adaptation • Similar Idea found in Hiemstra (CLEF 2000) • Introduce our version for Domain Specific Track 07 • Main Idea: • Retrieval is disambiguating process • Relevant Documents contains the context of query terms translations: they are implicitly coherent …. • SO DO PSEUDO RELEVANT DOCUMENTS !

Monolingual Pseudo Feedback with LM • MIXTURE MODEL ( C.Zhai 2001 ) • For all documents in Pseudo Feedback Set F: • For i=1 to document length • Choose • the distribution of the relevant topic model ( θ ) • or the distribution of the corpus language model P(w|C) • Sample a word from that distribution. • Selected Words = the most probable words in θ

Bilingual Dictionary Adaptation • For all documents in Pseudo Feedback Set F: • For i=1 to document length • Choose • a query term (associated distribution from a dictionary) P(wt | qs) θst • or the distribution of the corpus language model P(w|C) • Sample a word from that distribution.

How to estimate st • The estimation of st is done by EM initializing it by the intial dictionary • New Translation Filtering Effect: - words not in the top F are filtered • |F| =50,100 - weights are reestimated • Idem for monolingual and thesaurus

Our official runs and our mistakes … • Lost relevant documents at indexing: • Kept Only English, French,German 240(BL) , 108(BNF) , 69(ONB) • Dictionary were not biased toward target collection but source language • Bad Translation of Queries : • β Parameter (Amount of Translation) identical for bilingual and monlingual runs

Some Pure Bilingual Experiments Resource:JRC corpus to extract dictionaries 3% Average Improvement

Post Analysis of Multilinguism

Conclusion • Multilinguism: Theory vs Practise • In theory seems a good idea • In practise, most best runs are “pure” monolingual or “pure” bilingual. • Dictionary Adaptation: • Partial solution to the problem of setting prior weights. • Got some improvements on sparse data this year • Partly Financed by

Thank you for your attention!

XRCE Participation to TEL

XRCE Participation to TEL

Presentation Transcript

Committed To Participation

The Call to Participation

Participation

TEL!

Participation

Participation

Committed To Participation

Participation

Barriers to research participation

Pathways to Participation

Participation

Participation

Participation

Widening participation to HE

Participation

Participation

Steps to Participation

GRNET participation to 6NET

XRCE at ImageCLEF 07