260 likes | 448 Views
Optimizing Search Engines using Clickthrough Data. by Thorsten Joachims. Presentation by M. Şükrü Kuran. Outline. Search Engines Clickthrough Data Learning of Retrieval Functions Support Vector Machine (SVM) for Learning of Ranking Functions Experiment Setup Offline Experiment
E N D
Optimizing Search Engines using Clickthrough Data by Thorsten Joachims Presentation by M. Şükrü Kuran
Outline • Search Engines • Clickthrough Data • Learning of Retrieval Functions • Support Vector Machine (SVM) for Learning of Ranking Functions • Experiment Setup • Offline Experiment • Online Experiment • Analysis of The Online Experiment • Conclusion and Future Work • References • Questions
Search Engines • Search engines utilize ranking systems to list results based on their relevance to the query • Current ranking systems are not optimized for relevance • As an alternative solution we can use Clickthrought Data to find more relevance optimized results
Clickthrough Data What is Clickthrough Data ? • Clickthrough data is the set of links that the user selects from the list of the links retreived by the search engine to a user-given query. Why is Clickthrough Data Important? • These are the most relevant links among the query results • Easier to acquire than user feedback (since the data is already in the logs of the search engines)
Clickthrough Data (2) • Users are less likely to click on a link that has a low ranking (Independent of the actual relevence) • Users typically scan the first 10 links in the result set [24] Thus, clickthrough data is not the absolute relevence value for the query but a good relative relevence value
Clickthrough Data (3) Example: Results for a search for SVM: 1. Kernel Machines 6. Archives of Support Vector 2. Support Vector Machine Machines 3. SVM-Light Support Vector Machine7. SVM demo Applet 4. Intr. To Support Vector Machines 8. Royal Holloway Support Vector 5. Support Vector Machine and Machine Kernel Methods Ref. 9. Support Vector Machine The Software 10. Lagrangian Support Vector Machine Home Page Among the 10 results, only links 1,3 and 7 is chosen (clickthrough data)
Clickthrough Data (4) link3 < * link2 link7 < * link2 link7 < * link4 link7 < * link5 link7 < * link6 : ranking preferred by the user (binary relation) We can generalize this preference information, link i < * link j for all pairs 1 <= j < i, with and
Learning of Retrieval Functions • Goal: We have to find a retrival function whose results are close to • In order to calculate the similarity between any given and , we have to use a performance metric • Average Precision (binary relevance) • Kendall’s VerySimple Good Performance Metric
Learning of Retrieval Functions (2) • Kendall’s • Between any two ranking functions the distance is, D : Set of documents in a query result P : # of concordant pairs in D x D Q : # of discordant pairs in D x D m : # of documents/links in D
Learning of Retrieval Functions (3) • Problem Defination of Learning an Appropriate Retrieval Function • For a fixed (but unknown) distribution of queries and target (user preferred) rankings the goal is, where is the distribution of queries
Support Vector Machine (SVM) for Learning of Ranking Functions • Usually machine learning in information learning is based on binary classification. (A document is either related to the query or not) • Since the information gathered from clickthrought data is not an absoulte relevancy information we cannot use binary classification
Support Vector Machine (SVM) for Learning of Ranking Functions (2) • Using a set of queries and user ranking sets (training data) we will select a ranking function among a family (F) of ranking functions Selection will be based on minimizing n : # of queries in the training set
Support Vector Machine (SVM) for Learning of Ranking Functions (3) • Then, we need to find a sound family of ranking functions. • How to find an F which includes an efficent ranking function (f) ?
Support Vector Machine (SVM) for Learning of Ranking Functions (4) • A set of functions, • Where ‘s are description based retrieval functions [10,11] • ‘s are weight vectors (2D) adjusted by learning
Support Vector Machine (SVM) for Learning of Ranking Functions (5) • Instead of maximizing directly our goal function we can minimize the Q in our performance measure • By using calssification SVM’s [7] minimize subject to
Experiment Setup • A baseline meta-search engine called Striver is used for testing purposes • Striver forwards a query to “MSNSearch, Google, Excite, Altavista and Hotbot” • Acquires top 100 results from each search engine • Based on the learned retrival function it selects top 50 of the 500(may be lesser if more than one engine has found a specific document)
Offline Experiment • Using Striver 112 Queries are recorded • A huge set of features are used to calculate the description based retrieval functions • The testing is done with different values of training set queries • Results from Google and MSNSearch are used for benchmarking purposes
Online Experiment • Striver is used by a group of people (20 people) • Based on these people’s queries training set of Striver is composed of 260 queries • The results are compared with results from Google, MSNSearch and Toprank (a simple meta-search engine)
Online Experiment (2) More clicks mean that (for Google) users clicked more links in the learned engine than they do in Google for 29 queries out of 88. Less clicks mean that (for Google) users clicked less links in the learned engine than they do in Google for 13 queries out of 88
Analysis of the Online Experiment • Since all of the users have used the engine for academic searches the learned data is good for searches in academic research topics • But it may not give that good results for different groups of people • We can say that learned engine is a customizable engine unlike traditional engines
Future Work and Conclusions • What is the optimal group size for user custimization? • Features can be tuned for better performance • Clustering algorithms can cluster users in WWW into subgroups based on their clickthrough data’s ? • Can malicious users corrupt the learning process by clicking irrelevant links, how it is avoided?
References [1] R. Baeza-Yates and B. Ribeiro-Neto. ModernInformation Retrieval. Addison-Wesley-Longman,Harlow, UK, May 1999. [2] B. Bartell, G. Cottrell, and R. Belew. Automaticcombination of multiple ranked retrieval systems. InAnnual ACM SIGIR Conf. on Research andDevelopment in Information Retrieval (SIGIR), 1994. [3] D. Beeferman and A. Berger. Agglomerative clusteringof a search engine query log. In ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining (KDD), 2000. [4] B. E. Boser, I. M. Guyon, and V. N. Vapnik. Atraininig algorithm for optimal margin classifiers. InD. Haussler, editor, Proceedings of the 5th AnnualACM Workshop on Computational Learning Theory,pages 144–152, 1992. [5] J. Boyan, D. Freitag, and T. Joachims. A machinelearning architecture for optimizing web searchengines. In AAAI Workshop on Internet BasedInformation Systems, August 1996. [6] W. Cohen, R. Shapire, and Y. Singer. Learning toorder things. Journal of Artificial IntelligenceResearch, 10, 1999. [7] C. Cortes and V. N. Vapnik. Support–vector networks.Machine Learning Journal, 20:273–297, 1995. [8] K. Crammer and Y. Singer. Pranking with ranking. InAdvances in Neural Information Processing Systems(NIPS), 2001. [9] Y. Freund, R. Iyer, R. Shapire, and Y. Singer. Anefficient boosting algorithm for combining preferences.In International Conference on Machine Learning(ICML), 1998. [10] N. Fuhr. Optimum polynomial retrieval functionsbased on the probability ranking principle. ACMTransactions on Information Systems, 7(3):183–204,1989. [11] N. Fuhr, S. Hartmann, G. Lustig, M. Schwantner,K. Tzeras, and G. Knorz. Air/x - a rule-basedmultistage indexing system for large subject fields. InRIAO, pages 606–623, 1991. [12] R. Herbrich, T. Graepel, and K. Obermayer. Largemargin rank boundaries for ordinal regression. InAdvances in Large Margin Classifiers, pages 115–132.MIT Press, Cambridge, MA, 2000. [13] K. H¨offgen, H. Simon, and K. van Horn. Robusttrainability of single neurons. Journal of Computerand System Sciences, 50:114–125, 1995. [14] T. Joachims. Making large-scale SVM learningpractical. In B. Sch¨olkopf, C. Burges, and A. Smola,editors, Advances in Kernel Methods - Support VectorLearning, chapter 11. MIT Press, Cambridge, MA,1999. [15] T. Joachims. Learning to Classify Text Using SupportVector Machines – Methods, Theory, and Algorithms.Kluwer, 2002. [16] T. Joachims. Unbiased evaluation of retrieval qualityusing clickthrough data. Technical report, CornellUniversity, Department of Computer Science, 2002.http://www.joachims.org. [17] T. Joachims, D. Freitag, and T. Mitchell.WebWatcher: a tour guide for the world wide web. InProceedings of International Joint Conference onArtificial Intelligence (IJCAI), volume 1, pages 770 –777. Morgan Kaufmann, 1997. [18] J. Kemeny and L. Snell. Mathematical Models in theSocial Sciences. Ginn & Co, 1962. [19] M. Kendall. Rank Correlation Methods. Hafner, 1955. [20] H. Lieberman. Letizia: An agent that assists Webbrowsing. In Proceedings of the Fifteenth InternationalJoint Conference on Artificial Intelligence (IJCAI’95), Montreal, Canada, 1995. Morgan Kaufmann. [21] A. Mood, F. Graybill, and D. Boes. Introduction tothe Theory of Statistics. McGraw-Hill, 3 edition, 1974. [22] L. Page and S. Brin. Pagerank, an eigenvector basedranking approach for hypertext. In 21st AnnualACM/SIGIR International Conference on Researchand Development in Information Retrieval, 1998. [23] G. Salton and C. Buckley. Term weighting approachesin automatic text retrieval. Information Processingand Management, 24(5):513–523, 1988. [24] C. Silverstein, M. Henzinger, H. Marais, andM. Moricz. Analysis of a very large altavista querylog. Technical Report SRC 1998-014, Digital SystemsResearch Center, 1998. [25] V. Vapnik. Statistical Learning Theory. Wiley,Chichester, GB, 1998. [26] Y. Yao. Measuring retrieval effectiveness based onuser preference of documents. Journal of the AmericanSociety for Information Science, 46(2):133–145, 1995.
Questions........ ? Experiment Results ? Clickthrough Data ? Machine Learning for Retrieval Functions ? Retrieval Functions ?