Topic Learning in Text & Conversational Speech

Topic Learning in Text & Conversational Speech Constantinos Boulis

Introduction • Definition of Topic Learning • Supervised : Learn a mapping of data to topics • Unsupervised : Discover new topics • Applications of Topic Learning • Crucial step for information access • Google News, call-center automation • Challenges of Topic Learning • A learning problem of very high dimensionality

An Example B.16: especially, you know, smaller areas, A.17: Uh-huh. B.18: smaller towns. A.19: Uh-huh. Yeah. Probably the hardest thing in, in my family, uh, my grandmother, she had to be put in a nursing home and, um, she had used the walker for, for quite some time, probably about six to nine months. And, um, she had a fall and, uh, finally, uh, she had Parkinson's disease, B.20: Oh. A.21: and it got so much that she could not take care of her house. B.22: Right. A.23: Then she lived in an apartment and, uh, that was even harder -- B.24: Uh-huh.

Impact • Interdisciplinary research on Natural Language Processing, Data Mining and Speech Recognition • Core technology can leverage fields such as Bioinformatics • All these technologies come together on the 311 line (TIME, Feb. 7th 2005)

My work Past work Dimensions of Topic Learning Less Supervision More Structured Input Less More

Dissertation Contributions • General Topic Learning Contributions (applicable to text, speech, gene expression etc) • Combining Multiple Clustering Partitions (*) • Feature Construction (*) • Topic Learning in Conversational Speech • Speech-to-text errors • Role of disfluencies • Separating content & style • Role of prominence (*)

Combining Multiple Clustering Partitions • Classifier combination is studied extensively but not much work on combining clustering systems • Fundamental problem: Missing correspondence between clusters of different systems {1,2,2,1,3,1,2,3,3,3,2,1} {3,1,1,3,2,3,1,2,1,2,3,2} • Contribution : New algorithms that estimate the correspondence of clusters then combine them using linear programming techniques and singular value decomposition

Feature Construction • A lot of work on supervised topic learning methods but not much on constructing feature spaces • Bag-of-words representation too coarse but hard to improve • Contribution : Add only those word pairs that contribute sufficiently new information than their constituting words, i.e. the whole is much more than the sum of its parts • “second hand” >> “second” + “hand” • “big brother” >> “big” + “brother”

Role of Prominence • Speech is a richer medium than text; it is not only what we say is also how we say it. • Prominence is the emphasis we put on words • Contribution : The first study to show that prominence can be combined with lexical saliency measures to yield improved feature subsets for topic learning

Summary • Topic learning a key step for information access (retrieval, extraction) • Key contribution : Advancing language processing for spoken documents • Unique elements of this work: Combining speech, language and data mining technology

Journal Publications Resulting from PhD • Deng, L., Wang, Y., Wang, K., Acero, A., Hon, H.-W., Droppo, J., Boulis, C., Mahajan, M., and Huang, X.D, February-March 2004, “Speech and Language Processing for Multimodal Human-Computer Interaction”, Journal of VLSI Signal Processing Systems, 36(2-3):161-187. • Boulis, C., Ostendorf, M., Riskin, E., Otterson, S. November 2002. “Graceful Degradation of Speech Recognition Performance Over Packet-Erasure Networks”, IEEE Transactions on Speech and Audio Processing, 10(8):580-590. • Deng, L., Wang, K., Acero, A., Hon, H.-W., Droppo, J., Boulis, C., Wang, Y.-Y., Jakoby, D., Mahajan, M., Chelba C., and Huang, X.D. November 2002. “Distributed Speech Processing in MiPad's Multimodal User Interface”, IEEE Transactions on Speech and Audio Processing, 10(8):605-619.

Conference Publications Resulting from PhD • Boulis, C., Kahn, J., Ostendorf, M., July 2005. “The Role of Disfluencies in Topic Classification of Natural Human-Human Conversations”, Proc. of the Workshop on Spoken Language Understanding, in press. • Boulis, C., Ostendorf, M., June 2005. “A Quantitative Analysis of Lexical Differences Between Genders in Telephone Conversations”, Proc. of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), in press. • Boulis, C., Ostendorf, M. April 2005. “Text Classification by Augmenting the Bag-of-Words Representation with Redundancy-Compensated Bigrams”, Proc. of the International Workshop on Feature Selection in Data Mining, pp 9-16. • Boulis, C., Ostendorf, M. September 2004. “Combining Multiple Clustering Systems”. Proc. of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), LNAI 3202, pp. 63-74. • Boulis, C. May 2004. “Speaker Recognition with Mixtures of Gaussians with Sparse Regression Matrices”, Proc. of the Student Research Workshop of Human Language Technology/North American Chapter of the Association for Computational Linguistics (HLT/NAACL), companion volume, pp. 55-60. • Riskin, E., Boulis, C., Otterson, S., Ostendorf, M. September 2001. “Graceful Degradation of Speech Recognition Performance Over Lossy Packet Networks”. Proc. of the 7th European Conference on Speech Communication and Technology (Eurospeech 2001), pp. 2715-2719.

Future Publications & Awards Resulting from PhD Manuscripts under review • Boulis, C., Ostendorf, M., “Combining Multiple Clustering Partitions”, Journal of Machine Learning. • Boulis, C., Ostendorf, M., “Using Symbolic Prominence to Help Design Feature Subsets for Topic Classification and Clustering of Natural Human-Human Conversations”, Interspeech-05. Manuscripts under preparation • Boulis, C. Ostendorf, M., “Unsupervised Estimation of Word Confusability and its Use in Topic Classification of Human-Human Conversations” Awards • Best Student Paper Award, PKDD 2004. 581 total submissions, 17% acceptance rate

Backup Slides The following slides are not used in the main presentation

Speech-to-Text Errors • Output of STT systems contain errors (~20%) • Some words have higher error rates than others • Contribution : Design algorithm that adaptively clusters confusable words, modifying the vocabulary provided for topic learning tasks • Provided gains in classification performance of 25% relative

Role of Disfluencies • Disfluencies are very common in conversational speech That’s all you need you only need one boxcar (repetition) So it’ll take um so you want to do what (repair) • Contribution : Demonstrate that removing disfluencies in topic classification performance does not impact the bag-of-words model, but does impact more complex representations

Separating Content & Style • When two people talk they bring into the discussion their idiosyncracies. Are there idiosyncracies in the gender level? • Can this affect topic classification? • Contribution : The first quantitative study to show that there are lexical differences between genders in telephone conversations • Almost 100% accuracy in detecting the gender of a speaker based on what he/she said • The gender of the speaker of one side can influence lexical patterns in the other side

Topic Learning in Text & Conversational Speech