80 likes | 211 Views
Name Disambiguation in Digital Libraries. Tan Yee Fan 2005 October 19 WING Group Meeting. Digital libraries. DBLP, Citeseer, etc. Information is stored as metadata records to facilitate searching Author names Titles Publication titles Inconsistency in metadata records hinders searching
E N D
Name Disambiguation in Digital Libraries Tan Yee Fan 2005 October 19 WING Group Meeting
Digital libraries • DBLP, Citeseer, etc. • Information is stored as metadata records to facilitate searching • Author names • Titles • Publication titles • Inconsistency in metadata records hinders searching • Abbreviation of names and publication titles • Typographical errors
Are they the same author? • Danny Poo • Danny C. C. Poo, Teck-Kang Toh, Christopher S. G. Khoo, Glenn Hong. Development of an Intelligent Web Interface to Online Library Catalog Databases. APSEC 1999: 64-7 • Danny Chiang Choon Poo, Isaac K. C. Tan. Design of an Automatic Annotation Framework for Corporate Web Content. APSEC 2004: 384-391 • Hui Yang • Maan A. Kousa, Ahmed K. Elhakeem, Hui Yang. Performance of ATM networks under hybrid ARQ/FEC error control scheme. IEEE/ACM Trans. Netw. 7(6): 917-925 (1999) • Hui Yang, Tat-Seng Chua. QUALIFIER: Question Answering by Lexical Fabric and External Resources. EACL 2003: 363-370
Who am I, I am who? • Author name disambiguation • Given a large number of citations, how to determine which name is which author? • Closely related problem: citation matching • Given a large number of citations, how to determine which citations refer to the same papers? • Solutions must be scalable • DBLP has more than 660,000 citations • Citeseer has more than 730,000 documents
Ideas • Idea 1: determine the research field • Unfortunately, paper titles have limited words and some conferences tend to be broad • Idea 2: use coauthors information • Likely that an author will collaborate with a selected group of people • This group will likely publish a number of papers together • To find the similarity of coauthor lists
Forward direction:M. Kan = M.-Y. Kan = Min-Yen Kan • Problem • Pairwise comparison on all the coauthor lists is very expensive (few days also cannot finish) • Solution • Soft clustering on the coauthor lists using some cheap distance measure • Then perform pairwise comparison within the clusters • What is a good soft clustering algorithm?
Backward direction:This Hang Cui is not that Hang Cui • Difficult to determine using the metadata alone without external resources • Many authors have several distinct research areas • Each research area with different collaborators • Currently investigating what kind of external resource to use • Goooooooooogle for URLs?
The end • But the research has just begun…