270 likes | 649 Views
Tehran University Using OWA Fuzzy Operator to Merge Retrieval System Results Hadi Amiri, Abolfazl AleAhmad, Caro Lucas, Masoud Rahgozar School of Electrical and Computer Engineering University of Tehran Farhad Oroumchian University of Wollongong in Dubai Outline The Persian Language
E N D
Tehran University Using OWA Fuzzy Operator to Merge Retrieval System Results Hadi Amiri, Abolfazl AleAhmad, Caro Lucas, Masoud Rahgozar School of Electrical and Computer Engineering University of Tehran Farhad Oroumchian University of Wollongong in Dubai
Outline • The Persian Language • Used Methods • Vector Space Model • Language Modeling • OWA Operator • The test collections • Experiment results • Conclusion University of Tehran - Database Research Group
Outline • The Persian Language • Used Methods • Vector Space Model • Language Modeling • OWA Operator • The test collections • Experiment results • Conclusion University of Tehran - Database Research Group
The Persian Language • It is Spoken in countries like Iran, Tajikistan and Afghanistan • It has Arabic like script and 32 characters written continuously from right to left • It’s morphological analyzers need to deal with many forms of words that are not actually Farsi • Example • The word “عادت” that has two plural forms in Farsi: • Farsi form“عادت ها” • Arabic form“عادات” University of Tehran - Database Research Group
Outline • The Persian Language • Used Methods • Vector Space Model • Language Modeling • OWA Operator • The test collections • Experimental results • Conclusion University of Tehran - Database Research Group
Vector Space Model List of Weights that produced the best results Best We used Lnu.ltu and Lnc.btc weighting schemas University of Tehran - Database Research Group
Outline • The Persian Language • Used Methods • Vector Space Model • Language Modeling • OWA Operator • The test collections • Experimental results • Conclusion University of Tehran - Database Research Group
Language Modeling • four ways to specify the rank of document d against query q • Considering P(D=d) as the prior probability of relevance of the document d to the query q • Lambda (λ ) is a smoothing parameter and is equal for each query term • if there is no previous relevance information available for a query, each query term will be considered equally important University of Tehran - Database Research Group
Language Modeling- Cont. University of Tehran - Database Research Group
Outline • The Persian Language • Used Methods • Vector Space Model • Language Modeling • OWA Operator • The test collections • Experimental results • Conclusion University of Tehran - Database Research Group
OWA Operator- Cont. • We used OWA operator as the merge operator. • The OWA weight of each document d is defined as: Each score xi is assigned by ith search engine to document d. If d is not present in the ith list then xi=0. . University of Tehran - Database Research Group
OWA Operator- Weighting Method • Quantifier Based Weighting • Degree of Importance Based Weighting University of Tehran - Database Research Group
Quantifier Based Weighting • linguistic quantifiers All, Most, Few, and At-Least-One as the weighting schemas, • All: consider documents appearing in all retrieval engines’ lists. This quantifier is suitable when the user is looking for precise answer • Most: a fuzzy majority operator that assumes the retrieval by the most of the engines to be sufficient for inclusion in the fused list. • Few: is a weaker weighting schemas in which it is enough for a document to be retrieved by a few number of retrieval engines. • At-Least-One: is the weakest weighting schemas in which it is enough for a document to appear in only one retrieval engine’s list to be included in the fused list. University of Tehran - Database Research Group
Degree of Importance Based Weighting • As the second weighting schema we use the position of the documents in the retrieved lists • The weight of each document d in the Li,q is defined by Ni is the number of elements in the ith list, Li,q, and POSi is the position of document d in Li,q. . University of Tehran - Database Research Group
Outline • The Persian Language • Used Methods • Vector Space Model • Language Modeling • OWA Operator • The test collections • Experimental results • Conclusion University of Tehran - Database Research Group
Test Collections • Qvanin Collection • Documents: Iranian Law Collection • 177089 passages • 41 queries and Relevance Judgments • Hamshari Collection • Documents: 600+ MB News from Hamshari Newspaper • 160000+ news articles • 60 queries and Relevance Judgments • BijanKhan Tagged Collection • Documents: 100+ MB from different sources • A tag set of 41 tags • 2590000+ tagged words University of Tehran - Database Research Group
Hamshahri Collection • We used HAMSHAHRI (a test collection for Persian text prepared and distributed by DBRG (IR team) of University of Tehran) • The 3rd version: • contains about 160000+ distinct textual news articles in Farsi • 60 queries and relevance judgments for top 20 relevant documents for each query University of Tehran - Database Research Group
Outline • The Persian Language • Used Methods • Pivoted normalization • N-Gram approach • Local Context Analysis • Our test collections • Experimental results • Conclusion University of Tehran - Database Research Group
Experiment results The precision of the six retrieval engines at different document cut-offs. The LM4 and the Lnu.ltu with slope 0.25 methods are better than the other systems University of Tehran - Database Research Group
Quantifier Based OWA Weighting The parameter n (in Most and Few quantifiers) indicates the minimum number of retrieval lists sufficient for inclusion in the merge process University of Tehran - Database Research Group
Experiment results The precision of the fusion methods at different document cut-offs. The bests are Most3 & Most4. University of Tehran - Database Research Group
Experiment results Comparing LM4 and Lnu.ltu methods with the best OWA results University of Tehran - Database Research Group
Statistical significance tests • Wilcoxon Signed Rank • T-Test University of Tehran - Database Research Group
Statistical significance tests • Based on T Test, both Most3 and Most4 methods are significantly better than LM4 method which is a confirmation of The Wilconxon Signed Rank test. • However, with the T-Test we can not confirm the significance of the Most3 and Most4 methods over the Lnu.ltu with slope of 0.25 method. University of Tehran - Database Research Group
Conclusion • We used two weighting namely quantifier based and degree-of-importance based weighting methods • The experimental results show that the best OWA operator, Most3 and Most4 (quantifier based OWA operators), only marginally improve over the best retrieval method on Persian text the LM4 methods. • However seems they produce better ranking since they push the relevant documents to higher ranks. • The significant tests we conducted seem to confirm that Most3 and Most4 are significantly better than all other methods but Lnu.ltu with slope of 0.25. • However, the superiority over the Lnu.ltu with slope of 0.25 was not confirmed by T-Test. University of Tehran - Database Research Group
Thanks, Questions ? http://ece.ut.ac.ir/dbrg University of Tehran - Database Research Group
OWA Operator- Cont. • The OWA weight of each document is computed by this Equation: WT is the transpose vector of W that defines the semantics of associated with the OWA operator B=[b1,b2,..,bn] is the vector X=[ x1, x2,…, xn] reordered so that bj=Minj(x1, x2,…, xn), that is the jth smallest element of all the x1, x2,…, xn. • we used a simple function to bring the scores ({xi, i=1,…,n}) into a same scale University of Tehran - Database Research Group