A comparative study of TF*IDF , LSI and multi-words for text classiﬁcation

A comparative study of TF*IDF, LSI and multi-words for text classiﬁcation Presenter : Jian-Ren ChenAuthors : Wen Zhang, TaketoshiYoshida, XijinTang2011.ESWA

Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

Motivation Although TF*IDF, LSI and multi-word have been proposed for a long time, there is no comparative study on these indexing methods, and no results are reported concerning their classiﬁcation performances.

Objectives • A comparative study of TF*IDF, LSI and multi-words for text classiﬁcation. - information retrieval - text categorization • indexing term: • semantic quality • statistical quality

Methodology - TF*IDF wi,j :the weight for term i in document j N:the number of documents in the collection tfi,j :is the term frequency of term i in document j dfi :is the document frequency of term i in the collection Terms (keywords) of the document collection documents

Methodology - LSI Given a term-document matrix X = [x1 , x2 , ... , xn ] єRm and suppose the rank of X is r, LSI decomposes the X using SVD as follows: 1. 2. Xk=Uk’ΣkVkT’ Terms (keywords) of the document collection documents

Methodology - Multi-word its occurrence frequency should be at least twice in a document. the length of the multi-word should be between 2 and 6

Experiments - Datasets

Experiments - Evaluation

Experiments -Chinese

Experiments -English

Experiments –t-test

Comparison

Conclusions • LSIcan produce better indexing in discriminative power. • LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods. • The number of dimension is still a decisive factor for indexing when we use different indexing methods for classiﬁcation.

Comments • Advantages - Compare with TF*IDF, LSI and multi-words • Disadvantage - semantic quality and statistical quality are considered merely by our intuition instead of theory • Applications - text mining

A comparative study of TF*IDF , LSI and multi-words for text classiﬁcation

A comparative study of TF*IDF , LSI and multi-words for text classiﬁcation

Presentation Transcript

Multi-Perspective Question Answering

Jillian L. Wendt University of the District of Columbia Deanna Nisbet Regent University

The Comparative Method

Big Question: How can words change people’s lives?

Comparative and Superlative Adjectives

It must include nearly all parts in the cutticulum Do not put 2 answers for the same question.

Text

此报告仅供客户内部使用。未经麦肯锡公司的书面许可，其它任何机构不得擅自传阅、引用或复制。

Formulaic Language in Academic Study

Teaching Text Structure

Text vs. Subtext

Level 1 Lesson1

Soil Colloids and Cation Exchange Capacity

Text Structure

Allele Mining: with respect to Comparative Protein Structure Modelling and Docking study

Example text Go ahead and replace it with your own text. This is an example text.

Slide Library

The Stranger

Text-main1