1 / 15

A comparative study of TF*IDF , LSI and multi-words for text classification

A comparative study of TF*IDF , LSI and multi-words for text classification. Presenter : Jian-Ren Chen Authors : W en Zhang , T aketoshi Y oshida , X ijin T ang 2011.ESWA. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation.

calder
Download Presentation

A comparative study of TF*IDF , LSI and multi-words for text classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A comparative study of TF*IDF, LSI and multi-words for text classification Presenter : Jian-Ren ChenAuthors : Wen Zhang, TaketoshiYoshida, XijinTang2011.ESWA

  2. Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments

  3. Motivation Although TF*IDF, LSI and multi-word have been proposed for a long time, there is no comparative study on these indexing methods, and no results are reported concerning their classification performances.

  4. Objectives • A comparative study of TF*IDF, LSI and multi-words for text classification. - information retrieval - text categorization • indexing term: • semantic quality • statistical quality

  5. Methodology - TF*IDF wi,j :the weight for term i in document j N:the number of documents in the collection tfi,j :is the term frequency of term i in document j dfi :is the document frequency of term i in the collection Terms (keywords) of the document collection documents

  6. Methodology - LSI Given a term-document matrix X = [x1 , x2 , ... , xn ] єRm and suppose the rank of X is r, LSI decomposes the X using SVD as follows: 1. 2. Xk=Uk’ΣkVkT’ Terms (keywords) of the document collection documents

  7. Methodology - Multi-word its occurrence frequency should be at least twice in a document. the length of the multi-word should be between 2 and 6

  8. Experiments - Datasets

  9. Experiments - Evaluation

  10. Experiments -Chinese

  11. Experiments -English

  12. Experiments –t-test

  13. Comparison

  14. Conclusions • LSIcan produce better indexing in discriminative power. • LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods. • The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification.

  15. Comments • Advantages - Compare with TF*IDF, LSI and multi-words • Disadvantage - semantic quality and statistical quality are considered merely by our intuition instead of theory • Applications - text mining

More Related