140 likes | 244 Views
Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches. Yi-An Lin and Yu-Te Lin. Motivation. Text categorization (TC) is extensively researched in English but not in Chinese . What’s feature engineering help in Chinese?
E N D
Automatic Chinese Text Categorization Feature Engineering and Comparison of Classification Approaches Yi-An Lin and Yu-Te Lin
Motivation • Text categorization (TC) is extensively researched in English but not in Chinese. • What’s feature engineering help in Chinese? • Should Chinese content be segmented? • What’s ML best for TC? – Naïve Bayes, SVM, Decision Tree, k Nearest Neighbor, MaxEnt, or Language Model Methods?
Outline • Data Preparation • Feature Selection • Feature Vector Encoding • Comparison of Classifiers • Feature Engineering • Comparison after Feature Engineering • Conclusion
Data Preparation • Tool: Yahoo News Crawler • Category • Entertainment • Politics • Business • Sports
Feature Selection • statistics:
Feature Vector Encoding • Binary: whether contains a word. • Count: number of occurrence. • TF: ratio of words occurrence. • TF-IDF: with inverse document freq.
Feature Engineering • Stop Terms: similar to stop words in English. • Group Terms: common substrings. • Key Terms: distinctive terms.
Comparison of feature engineering methods S: stop terms G: group terms K: key terms
Conclusion • N-gram model outperforms other methods: • Language Models’ nature: considering all features and avoid error-prone ones. • No restrictive independence (ex. NB). • Better smoothing. • Feature engineering also helps reducing the sparsity but may cause ambiguity. • Semantic understanding could be the next to try in future research.