Using Concept Maps in NDLTD as a Cross-Language Summarization Tool for Computing–Related ETDs

Using Concept Maps in NDLTD as a Cross-Language Summarization Tool for Computing–Related ETDs Ryan Richardson, Edward A. Fox Digital Libraries Research Laboratory Virginia Tech {ryanr, fox} @vt.edu ETD2007, Uppsala, Sweden June 14, 2007

The Concept Map: From learning tool to cross-language knowledge discovery tool Outline of talk • Motivation • Techniques for concept map creation • Monolingual Experiment • Using ETDs as a source of noun phrase translations • Cross-language Experiment • Summary & Future Work

The Concept Map: From learning tool to cross-language knowledge discovery tool Problem: The web has made many large documents available, but summarizing them is difficult. • One example is ETDs. • NDLTD has over 300,000 ETDs in at least 12 languages • Many more in Spanish speaking world not in NDLTD, e.g., UNAM in Mexico City has 50,000+ ETDs • ETDs exist in many languages, but summarizing across languages is even more difficult.

The Concept Map: From learning tool to cross-language knowledge discovery tool Drawbacks of common summarization techniques: • Abstracts are often available but • Abstracts may mention topics only briefly discussed, and may omit topics discussed at length. • Many abstracts are hard to read. • It may be difficult to adjust to the variations among abstracts, e.g., length and form, across subject areas and languages. • Translating abstracts well is difficult given the state of machine translation (MT). • Keywords (not always available with ETDs) cannot show relationships between the keywords. • Keywords cannot show where a topic is mentioned in a document (lit. review, evaluation, conclusion, etc.).

The Concept Map: From learning tool to cross-language knowledge discovery tool` • Our suggested solution: • Automatically generated Concept Maps • Maps automatically translated into Spanish • Observation: • Since we have ETDs in multiple languages, why not use them as a resource for translating phrases?

2. Techniques for concept map creation Adopted 4 strategies for concept map generation • Use an advanced NLP tool (Relex) to find entities and relationships. • Restrict to one genre (ETDs) and one discipline (CS). • Search for known terms from an ontology in that specific discipline (i.e., CS). • Include in the maps the context of the relationship in the original document, using the hovertext feature present in IHMC’s CMapTools.

2. Techniques for concept map creation Relex: • Based on ideas from Minipar by Dekang Lin (2001) • Relex labels each node with a concept (represented by a noun or noun phrase). • Each link is labeled with one or more linking words, which usually are verbs or prepositions. • Links can sometimes have nouns as labels, in the case of complex noun phrases. • Produces XML file containing each relation, containing the sentence in the original document where it was found.

2. Techniques for concept map creation • Relex uses the GATE (General Architecture for Text Engineering) natural language processing toolkit, developed at U. of Sheffield. • Default entity tagger is ANNIE (Another Nearly New Information Extractor), which comes with GATE. • Relex has 2 built-in entity taggers, one for newspaper text and one for biomedical texts.

2. Techniques for concept map creation • Used lists of CS terms from The Ontology Project (Villanova, 2005) • Divides all of CS & Information Technology into 21 top level categories, and includes subcategories 1 to 6 levels deep. Has over 900 leaf nodes • ANNIE uses ‘gazetteers’ to list possible words that can represent a class/concept in a document. • We encoded the ontology into ANNIE by creating a gazetteer for each leaf node containing terms that can represent that concept in the text. • Gazetteers were written for 244 of the leaf nodes, covering the most common topics in VT’s computer science collection.

2. Techniques for concept map creation • Since Relex produces an XML file showing the sentence where each entity-relation is found, we can add that info to the concept map in CMapTools. • When a user hovers over link text, the sentence from the original ETD is shown. • In a Table of Contents map, for relations connecting chapters to sections, hovering over link text shows the first 200 characters of that section.

2. Techniques for concept map creation Detail of Relex map showing relationship between ‘OAI’ and ‘bandwidth’

2. Techniques for concept map creation Detail of ToC map showing 1st 200 chars of section 2.8.

3. Monolingual Experiment Presented 3 concept map layouts to subjects • Maps drawn using only chapter & section headings extracted from ETDs (ToC maps) • Maps drawn using only the relations Relex found for each chapter in the ETD (Relex maps) • Maps combining the above 2 styles (ToC+Relex maps) • Each user saw 2 of 3 possible CM layouts (ToC, Relex, ToC + Relex maps). • Asked subjective questions about helpfulness of nodes, links, hovertext, and concept maps overall • Run as in-class experiment with 35 students

3.Monolingual Experiment Top level “Overview” Map

3. Monolingual Experiment Table of Contents only map

3. Monolingual Experiment Relex-only map

3. Monolingual Experiment Table of Contents + Relex map

3. Monolingual Experiment • Based on Likert scores, users preferred both Relex-only and Table of Contents+Relex maps to the Table of Contents-only maps (significant difference in 3 of 5 categories in both cases). • Only significant difference between Relex-only and Table of Contents+Relex maps was in hovertext. Users preferred the hovertext in the Relex-only maps (maybe because it was shorter and directly related to the nodes?).

4. Using ETDs as a source of noun phrase translations • To translate the concept maps into Spanish, one must translate not only single words but also phrases. • Dictionaries lack these technical phrases. • Solution was to use English and Spanish ETDs as a comparable corpus from which to mine phrases. • Thus a Spanish corpus is needed.

4. Using ETDs as a source of noun phrase translations • Acquired Spanish ETDs from Scirus, UDLA (Puebla), and UNAM (Mexico City) • ETDs from Scirus did not list the discipline or department which produced them. • Found 417 ETDs from Scirus which contained the terms “computacion” or “informatica” • Filtered collection down to 78 ETDs from UNAM about CS • Used all 89 ETDs from UDLA • Extracted all noun phrases of length 2-5 from this collection, producing a list of 730,000+ noun phrases

4. Using ETDs as a source of noun phrase translations • Implemented Lopez-Osteñero’s phrase translation algorithm, which requires a bilingual dictionary • Acquired an 46,000+ entry dictionary from U. of Maryland • Made changes and efficiency enhancements to Lopez-Osteñero’s algorithm: • Took advantage of fact that L.-O. algorithm ignores word order of original phrase to speed up code

4. Using ETDs as a source of noun phrase translations Test of Lopez-Osteñero Algorithm: • Extracted list of 1971 English phrases (2-5 words) from the Ontology Project • Looked for phrase translation in 730K phraselist mined from Spanish ETDs • Found translations for 631 of them (32%) • Two Spanish experts looked at translations and determined, of the translations we found, over 80% were correct (average of both experts) • Speedup was 4x compared to original code (from 7.2 hours to 1.75 hours)

5. Cross-language Experiment • Cross-language experiment with 22 students at UDLA, via web • Essentially is the fully automated version of Pilot Experiment 1 • Test collection selected from Virginia Tech computer science ETDs (30 ETDs) • Concept maps were automatically generated as HTML + JPEG images, formatted by AT&T’s Graphviz, since students had never used CMapTools before. • Half of CMs & abstracts were translated by paid professional translator, other half translated automatically by phrases mined from ETD collections, plus Systran

5. Cross-language Experiment

5. Cross-language Experiment English version of ETD by Saraiya

5. Cross-language Experiment Spanish (automatic) translation of ETD by Saraiya

5. Cross-language Experiment • Beforehand, 1 expert came up with 6 questions and listed which of the 30 ETDs were relevant to the questions. • 2 other experts looked at questions and gave their own relevance judgments. • Subjects were given summaries of 5 ETDs (abstract, CM, or both) for each question. • Asked to determine which ETDs are relevant to these 6 questions • Sets of 5 ETDs were disjoint for all 6 questions, so a subject never saw same ETD more than once.

5. Cross-language Experiment • For all 6 questions, ETDs were selected that had perfect agreement between the 3 experts as to relevance. • For each question, we measured agreement between experts and subjects for the 5 ETDs, so scale was 0 (disagreed about relevance of all 5 ETDs) to 5 (perfect agreement). • In results, agreements of 1 to 5 actually occurred (nobody ever disagreed with experts completely).

5. Cross-language Experiment Within-Subjects Design, 6 test conditions (1 question in each condition)

5. Cross-language Experiment Likert Questions • Subjects also asked to rank helpfulness of concept map nodes, links, and concept maps overall, as well as helpfulness of abstracts, w.r.t. how much they helped them determine relevance to questions • 5-pt Likert scale

5. Cross-language Experiment ANOVA Results Two factors • Presentation Type – 3 levels (abstract, CM, abstract+CM) • Translation Type – 2 levels (human, machine) Presentation Type effect was significant (p=0.05) for expert/subject agreement (p=.008) Translation Type effect not significant (p=.171) Presentation Type * Translation Type interaction not significant (p=.939)

5. Cross-language Experiment Means (human and machine translated combined) Concept Map alone significantly higher than Abstract alone (p=.001) Abstract + Concept Map significantly higher agreement than Abstract alone (p=.027) No significant difference between Concept Map vs. Abstract + Concept Map condition (p values from Wilcoxon test)

5. Cross-language Experiment Likert Results Subjects found concept maps (overall) more helpful for assessing relevance than abstract. Means Abstract: 3.05 / 5 CM: 4.24 / 5 Significant at p=0.05 level Users did not find significant difference in quality of nodes vs. links (relations).

6. Summary Using • NLP tools • Domain-specific ontology We have been able to automatically produce concept maps for large documents (ETDs). For the cross-language case, using • Phrase translations mined from ETD collection • Off-the-shelf MT tools We have been able to automatically produce & translate concept maps that allowed users to determine relevance of ETDs better than using machine-translated abstracts alone.

6. Future Work • Extract images/figures from PDF files and insert them directly into concept maps as resources • Alternatively, use superimposed information (SI) tools to allow automatically-generated concept maps to point back to text, tables, and images in the original PDF (Murthy et al. 2006) • Work with universities to obtain larger number of Spanish ETDs in CS and other technical fields to ‘learn’ more phrase translations

Acknowledgments Thanks to: • Funding from NSF through grants DUE-0121679 and IIS-0080748, a grant from IMLS, and the RDEC/ACE project • Ben Goertzel, Hugo Pinto, and Vicente Cordeiro for their help with modifying Relex • Lillian Cassell and her team at Villanova for their work on the Ontology Project • Craig Scott for making the Scirus data available • Alberto Castro Thompson and Silvia González Marín at UNAM for making their ETD collection available to us • Alfredo Sanchez and Gabriel Torres at UDLA • People at IHMC for their help interfacing with CMapTools • Other current and former members of VT’s DLRL who helped with experiments

References • Bontcheva, K., Tablan, V., Maynard, D. and Cunningham, H. Evolving GATE to Meet New Challenges in Language Engineering, Natural Language Engineering 10(3) 349-373 2004. • Cassel, L. N. "Information organization for the union of computing related disciplines", 2005. http://what.csc.villanova.edu/twiki/pub/Main/ProjectArchive/2005-10-31-Villanova.ppt • Lin, D. and Pantel, P. Discovery of Inference Rules for Question Answering. Journal of Natural Language Engineering, 2001. • López-Ostenero, F., Gonzalo, J., and Verdejo, F. Noun Phrases as building blocks for Cross-Language Search Assistance. Information Process and Management, 41, pp. 549-568, September, 2004.

References • Murthy, U., Richardson, R., Fox, E.A., Delcambre, L. Enhancing Concept Mapping Tools Below and Above to Facilitate the Use of Superimposed Information. Presented at 2nd Concept Mapping Conference, San José, Costa Rica, 2006. • Novak, J. D. and Gowin, D. B. Learning How To Learn. Cambridge, UK, Cambridge University Press, 1984. • Wierzbicka, A. Semantics, Primes and Universals. Oxford, UK, Oxford University Press, 2006.

Questions?

Using Concept Maps in NDLTD as a Cross-Language Summarization Tool for Computing–Related ETDs

Using Concept Maps in NDLTD as a Cross-Language Summarization Tool for Computing–Related ETDs

Presentation Transcript