240 likes | 382 Views
Subproject III - Spoken Language Systems. Members: Lin-shan Lee (PI), Lee-Feng Chien (Co-PI) Hsin-min Wang (Co-PI), Berlin Chen (Co-PI) Other Participants: Sin-Horng Chen, Yih-Ru Wang
E N D
Subproject III -Spoken Language Systems Members: Lin-shan Lee (PI), Lee-Feng Chien (Co-PI) Hsin-min Wang (Co-PI), Berlin Chen (Co-PI) Other Participants: Sin-Horng Chen, Yih-Ru Wang Yuan-Fu Liao, Jen-Tzung Chien
Outline • Members • Research Theme • Current Achievements with Demos • Future Directions
Research Theme Information Extraction and Retrieval (IE & IR) Multimedia Network Content Spoken Dialogues Networks Users ˙ Spoken Document Understanding and Organization
Research Roadmap Current Achievements Future Directions Information Navigation across Multimedia/Spoken Documents • Term Extraction/Organization • Term Translation/Indexing • Information Extraction • And Retrieval (IE & IR) • Retrieval Modeling Cross-language Information Processing • Spoken Document Understanding and Organization • Title/Summary Generation Knowledge Discovery and Web Mining • Topic Analysis/Organization • Spoken Dialogues • ….. Speech & LanguageUnderstanding Spoken Language Applications • Distributed Speech Recognition
Information Extraction & Retrieval (IE & IR) • Named Entity Extraction from Text/Spoken Documents • Taxonomy Generation • Term Translation • Retrieval Modeling for Text/Spoken Documents
Named Entity Extraction from Text/Spoken Documents • Global Information for the Entire Document Extracted from Forward/Backward PAT-Trees • Some named entities may not be easily identified from a single sentence, but can be extracted when information in several sentences jointly considered • Named Entity Matching using Retrieved Text Documents to Identify Some Out-of-Vocabulary (OOV) Words
Automatic Taxonomy Generation (1/2) • Problem • Find relationships and associations between terms, and organize them into a hierarchical structure (i.e. taxonomy) • Useful for identifying and analyzing concepts embedded in documents and queries • Method • An approach proposed for clustering terms into comprehensive hierarchical clusters • Web mining techniques -- automatically generating relationships between terms based on relationships between documents retrieved with the terms from the Web
Automatic Taxonomy Generation (1/2) • A Typical Example for Term Taxonomy
Automatic Term Translation (1/2) • Problem • Cross-language information retrieval systems usually rely on bilingual dictionaries; however, search terms are very often missing because they are proper nouns and OOVs • Discovering translations of unknown query terms in different languages • Method • Finding translations of query terms via mining of huge quantities of data obtained from the Web • Correlation/Association patterns extracted from parallel bilingual pages retrieved from the Web, the anchor texts of the pages indicating out-links to multi-lingual pages, etc.
Automatic Term Translation (2/2) • The Live Query Term Translation System (LiveTrans) Machine- Extracted Translations http://wkd.iis.sinica.edu.tw/LiveTrans/lt.html
Retrieval Modeling for Text/Spoken Documents (1/2) • Problem • Conventional retrieval models can not be trained or improved through use • Word usage mismatch between the query and the documents • Method • Literal term matching: HMM/N-gram model trained with ML or MCE criteria • Concept matching: Topical mixture model (TMM), extended from PLSA, trained in either supervised or unsupervised manner
Retrieval Modeling for Text/Spoken Documents (2/2) • HMM/N-gram retrieval model • A document is viewed as a probabilistic generative model for the query • Literal term matching • Topical Mixture Model (extended from PLSA) • A document is composed of a set ofK latent topical distributions (unigrams) for predicting the query • Concept matching
Spoken Document Understanding & Organization (1/2) • Problem • The content of multimedia documents very often described by the associated speech information • Unlike text documents with paragraphs/titles easy to look through at a glance, multimedia/spoken documents are unstructured and difficult to retrieve/browse
Spoken Document Understanding & Organization (2/2) • Spoken Document Transcription • Multimedia/Spoken Document Segmentation • Summarization for Multimedia/Spoken Documents • Title Generation for Multimedia/Spoken Documents • Topic Analysis and Organization for Multimedia/Spoken Documents
… … distance computation Spoken Document Segmentation (Broadcast News) • Dividing a one-hour News Episode into News Stories • An improved audio segmentation technique integrating BIC and Divide-and-Conquer Approaches • Viterbi search over the Hidden Markov Model of text clusters
Title Generation for Spoken Documents (Broadcast News) • Training Phase • Generation Phase • For Training Phase • Developing statistical relationships between words in the training documents and their human-generated titles • For New Spoken Documents • Transcribing into term sequences • Identifying suitable terms, and using them to generate a readable title Human-generated Titles of Training Documents T={tj, j=1,2,…,N} (text form) Training Documents D={dj, j=1,2,…,N} (text form) Computer-generated Titles of Spoken Documents T={ti, i=1,2,…,M} (text/speech form) New Spoken Documents D={di, i=1,2,…,N} (speech form)
Topic Analysis and Organization for Spoken Documents (Broadcast News) • Based on Probabilistic Latent Semantic Analysis (PLSA) • Terms (words, syllable pairs, etc.)/documents analyzed by probabilities considering a set of latent topics • Trained by EM algorithm • Related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents • Spoken Documents Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or a Two-layer Map Two-dimensional Tree Structure for Organized Topics
Spoken Dialogues • Analysis and Design Using Quantitative Simulations
Analysis and Design Based on Quantitative Simulations • Problem • Dialogue performance cannot be predicted before the system is on line • The effects of different factors, such as the system’s dialogue strategies, speech recognition and understanding conditions etc., cannot be quantitatively identified and analyzed • Method • Computer-aided analysis and design approaches based on quantitative simulations transaction success rate slot loss rate misunderstanding rate
Demo: Understanding and Organization of Chinese Broadcast News with Interactive Interface
Spoken Document Understanding & Organization (1/2) • Problem • The content of multimedia documents very often described by the associated speech information • Unlike text documents with paragraphs/titles easy to look through at a glance, multimedia/spoken documents are unstructured and difficult to retrieve/browse
Topic Analysis and Organization for Spoken Documents (Broadcast News) • Based on Probabilistic Latent Semantic Analysis (PLSA) • Terms (words, syllable pairs, etc.)/documents analyzed by probabilities considering a set of latent topics • Trained by EM algorithm • Related documents don’t have to share common sets of terms, and related terms don’t have to co-exist in the same set of documents • Spoken Documents Clustered by the Latent Topics and Organized in a Two-dimensional Tree Structure, or a Two-layer Map Two-dimensional Tree Structure for Organized Topics
Future Directions • Information Navigation across Multimedia/Spoken Documents • Fast growing of quantities of multimedia/spoken documents are much more difficult tobrowse compared to text documents • Better approaches to navigate across huge quantities of multimedia/spoken documents using comprehensive presentation (e.g. topic taxonomy) • Cross-language Information Processing Technologies • Reducing language barriers in a future world of multilingual environment • Seeking for international collaboration and resource exchanging • Collaboration between the two major non-English languages may be a good direction • Knowledge Discovery and Web Mining • Web offers live, dynamic and by far the most complete global knowledge the human beings have • Better approaches to explore the Web resources and enhance the language processing technologies