Tools for Natural Language Processing Applications

Tools for Natural Language Processing Applications Guruprasad Saikumar & Kham Nguyen

OUTLINE • Natural Language Toolkit • Part of Speech Taggers • Parsers • Language Modeling • Back-off N-gram Language Model

Natural Language Toolkit • Created as part of a computational linguistics course in the Dept of Comp & Info. Science, University of Pennsylvania ( 2001). • Natural Language Toolkit (NLTK) can be used as • Teaching tool • Individual study tool • Platform for prototyping and building research systems • NLTK is organized as a flat hierarchy of packages and modules

NLTK contents: • Python Modules • Tutorials • Problem sets • Reference documentation • Technical documentation • Current NLTK Modules: • Basic operations like tokenization, tree structure, etc. • Tagging • Parsing • Visualization Useful link http://nltk.sourceforge.net/

Parts of Speech TAGGERS

Stanford Log-linear Part-of-speech Tagger • By Kristina Toutanova • Maximum Entropy based POS Tagger • Java implementation • Two trained Tagger models for English, using Penn Treebank tag set Link to download software: http://nlp.stanford.edu/software/tagger.shtml References: • Kristina Toutanova and Christopher D. Manning. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong. • Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages 252-259.

Tree Tagger • Institute for Computational Linguistics of the University of Stuttgart. • Based on modified version of ID3 decision tree algorithm • Tagger for languages – English, German, French, Italian, Spanish, Greek p(NN|DET,ADJ) Download link: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

Brill Tagger • Developed by Eric Brill • Transformation based POS Tagger • The rule-based part of speech tagger works by first assigning each word its most likely tag • Rules are learned to use contextual cues to improve tagging accuracy Download Link http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z References: • Some advances in rule-based part of speech tagging, Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Wa., 1994.

TnT –Trigrams’n Tags • A statistical POS Tagger • Developed by Thorsten Brants, Saarland University. • Implementation of Viterbi algorithm for second order Markov models • Language models available for German and English. • Tagger can be adapted to new languages. Useful Link: http://www.coli.uni-saarland.de/~thorsten/tnt/

Parsers

Stanford Parser • Contributions mainly by Dan Klein with support code and linguistic grammar development by Christopher Manning • Java implementation of probabilistic natural language parser. • Online parser link http://josie.stanford.edu:8080/parser/ Download link http://www-nlp.stanford.edu/downloads/StanfordParser-2005-07-21.tar.gz Reference: Dan Klein and Christopher D. Manning. 2002. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), December 2002.

Language Modeling

SRI Language Modeling Toolkit • SRILM was mainly developed by Andreas Stolcke, Speech Technology and Research Laboratory, CA. • SRILM is a collection of C++ libraries, executable programs and helper scripts • Main application - statistical modeling for speech recognition • LM based on n-gram statistics Reference “SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002 Download Link: http://www.speech.sri.com/projects/srilm/download.html

Text analysis and summarization Tools System Quirk • Text analysis • Generate word lists • indexing • Tracker Download link http://www.mcs.surrey.ac.uk/SystemQ/ MEAD • summarization and evaluation tool • Some features of the tool are • Multiple document summarization • Query based summarization • Various evaluation methods. Download link http://tangra.si.umich.edu/clair/mead/download/MEAD-3.07.tar.gz

Back-off N-gram Language Model Kham Nguyen (kham@ccs.neu.edu) Spring 2006 Northeastern University

Outline • Statistical language modeling • What is an N-gram language model • Back-off N-gram Northeastern University

What is “statistical language modeling”? • A statistical language model (LM) provides a mechanism for computing: P(<s>, w1, w2, …, wn, </s>) • Used in speech recognition • Also used in machine translation, language identification, etc Northeastern University

N-gram Language Model • The simplest form of an N-gram probability is just the relative frequency of the N-gram Northeastern University

Back-off N-gram Language Model • Necessary for unseen N-grams by “backing off” to lower order N-grams • For simplicity, abbreviate the history (w1, w2,..,wn-1) as h, and (w2,..,wn-1) as h’ • n=1: unigram • n=2: bigram • n=3: trigram Northeastern University

Back-off weight • A back-off N-gram is: • The probability axiom requires: • Define 2 disjoint sets of wi’s: • -BO(wi|h): set of all wi that wi|h seen in training data, and • BO(wi|h): set of all wi that wi|h unseen in training data Northeastern University

Back-off weight (cont.) Northeastern University

Perplexity • The “quality” of an LM is typically measured by its perplexity, or the “branching” factor • Basically, the perplexity is the average numbers of words that can appear after a history Northeastern University

N-gram LM for Speech Recognition • Language Model is one of the knowledge sources used in automatic speech recognition (ASR) • Almost all State-of-the-art ASR systems use Back-off N-gram LM, N typically is 3 (or trigram) Northeastern University

Tools for Natural Language Processing Applications

Tools for Natural Language Processing Applications

Presentation Transcript

Natural Language Processing Applications

Natural Language Processing Applications

Natural Language Processing Applications

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural language processing tools

Applications of Natural Language Processing

Applications of Natural Language Processing

Natural Language Processing

Natural Language Processing

Applications of Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

APPLICATIONS IN NATURAL LANGUAGE PROCESSING

Natural Language Processing Applications

Natural Language Processing

Natural Language Processing