Metadata in Carrot II

Metadata in Carrot II • Current metadata • TF.IDF for both documents and collections • Full-text index • Metadata are transferred between different nodes • Potential Problems • Storage cost: metadata size is huge, • computation cost: computation time is long • Communication cost: metadata transfer time is long • semantic meaning of text: less semantic • Goal • Need an more efficient mechanism to represent documents/collections

Proposed Approach • Sources for metadata generation • Text Summaries vs. full text • Multi-document Summarization on collections • Metadata Organization • Topic Hierarchy • Automatic metadata generation • Statistical Language Model

Document (Text) Summarization • Document summarization (DS) • “The process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks).” (Mani & Maybury, 1999) • Full text can be reduced to an abstract without losing too much useful information • Multi-Document Summarization (MDS) • Work on related documents (same topic) • can capture relations across documents

Language Model • Language model • An approximation to real language • Try to explain already observed phenomena or future behavior • A probability distribution over strings in a finite alphabet • Basic idea when using in IR • Infer a language model for each document • Estimate the probability of generating the query according the models, rather than estimating the probability of relevance each document to the query • Rank the documents according to these probabilities

References • Inderjeet Mani, Mark T. Maybury. Advances in Automatic Text Summarization. MIT Press.1999 • J. Ponte. A Language Modeling Approach to Information Retrieval, In PhD Thesis. Dept. of Computer Science, University of Massachusetts, Amherst, 1998. • Michael P. Oakes. Statistics for Corpus Linguistics. Edinburgh University Press. 1998

Metadata in Carrot II