230 likes | 240 Views
Explore the tools like NLTK, Stanford Tagger, Tree Tagger, Brill Tagger, TnT, parsers, SRI Language Modeling Toolkit, text analysis tools, and Back-off N-gram Language Model for NLP tasks.
E N D
Tools for Natural Language Processing Applications Guruprasad Saikumar & Kham Nguyen
OUTLINE • Natural Language Toolkit • Part of Speech Taggers • Parsers • Language Modeling • Back-off N-gram Language Model
Natural Language Toolkit • Created as part of a computational linguistics course in the Dept of Comp & Info. Science, University of Pennsylvania ( 2001). • Natural Language Toolkit (NLTK) can be used as • Teaching tool • Individual study tool • Platform for prototyping and building research systems • NLTK is organized as a flat hierarchy of packages and modules
NLTK contents: • Python Modules • Tutorials • Problem sets • Reference documentation • Technical documentation • Current NLTK Modules: • Basic operations like tokenization, tree structure, etc. • Tagging • Parsing • Visualization Useful link http://nltk.sourceforge.net/
Stanford Log-linear Part-of-speech Tagger • By Kristina Toutanova • Maximum Entropy based POS Tagger • Java implementation • Two trained Tagger models for English, using Penn Treebank tag set Link to download software: http://nlp.stanford.edu/software/tagger.shtml References: • Kristina Toutanova and Christopher D. Manning. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong. • Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages 252-259.
Tree Tagger • Institute for Computational Linguistics of the University of Stuttgart. • Based on modified version of ID3 decision tree algorithm • Tagger for languages – English, German, French, Italian, Spanish, Greek p(NN|DET,ADJ) Download link: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
Brill Tagger • Developed by Eric Brill • Transformation based POS Tagger • The rule-based part of speech tagger works by first assigning each word its most likely tag • Rules are learned to use contextual cues to improve tagging accuracy Download Link http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z References: • Some advances in rule-based part of speech tagging, Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Wa., 1994.
TnT –Trigrams’n Tags • A statistical POS Tagger • Developed by Thorsten Brants, Saarland University. • Implementation of Viterbi algorithm for second order Markov models • Language models available for German and English. • Tagger can be adapted to new languages. Useful Link: http://www.coli.uni-saarland.de/~thorsten/tnt/
Stanford Parser • Contributions mainly by Dan Klein with support code and linguistic grammar development by Christopher Manning • Java implementation of probabilistic natural language parser. • Online parser link http://josie.stanford.edu:8080/parser/ Download link http://www-nlp.stanford.edu/downloads/StanfordParser-2005-07-21.tar.gz Reference: Dan Klein and Christopher D. Manning. 2002. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), December 2002.
SRI Language Modeling Toolkit • SRILM was mainly developed by Andreas Stolcke, Speech Technology and Research Laboratory, CA. • SRILM is a collection of C++ libraries, executable programs and helper scripts • Main application - statistical modeling for speech recognition • LM based on n-gram statistics Reference “SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002 Download Link: http://www.speech.sri.com/projects/srilm/download.html
Text analysis and summarization Tools System Quirk • Text analysis • Generate word lists • indexing • Tracker Download link http://www.mcs.surrey.ac.uk/SystemQ/ MEAD • summarization and evaluation tool • Some features of the tool are • Multiple document summarization • Query based summarization • Various evaluation methods. Download link http://tangra.si.umich.edu/clair/mead/download/MEAD-3.07.tar.gz
Back-off N-gram Language Model Kham Nguyen (kham@ccs.neu.edu) Spring 2006 Northeastern University
Outline • Statistical language modeling • What is an N-gram language model • Back-off N-gram Northeastern University
What is “statistical language modeling”? • A statistical language model (LM) provides a mechanism for computing: P(<s>, w1, w2, …, wn, </s>) • Used in speech recognition • Also used in machine translation, language identification, etc Northeastern University
N-gram Language Model • The simplest form of an N-gram probability is just the relative frequency of the N-gram Northeastern University
Back-off N-gram Language Model • Necessary for unseen N-grams by “backing off” to lower order N-grams • For simplicity, abbreviate the history (w1, w2,..,wn-1) as h, and (w2,..,wn-1) as h’ • n=1: unigram • n=2: bigram • n=3: trigram Northeastern University
Back-off weight • A back-off N-gram is: • The probability axiom requires: • Define 2 disjoint sets of wi’s: • -BO(wi|h): set of all wi that wi|h seen in training data, and • BO(wi|h): set of all wi that wi|h unseen in training data Northeastern University
Back-off weight (cont.) Northeastern University
Perplexity • The “quality” of an LM is typically measured by its perplexity, or the “branching” factor • Basically, the perplexity is the average numbers of words that can appear after a history Northeastern University
N-gram LM for Speech Recognition • Language Model is one of the knowledge sources used in automatic speech recognition (ASR) • Almost all State-of-the-art ASR systems use Back-off N-gram LM, N typically is 3 (or trigram) Northeastern University