Natural Language Toolkit

Natural Language Toolkit Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html

Overview • The NLTK is a set of Python modules to carry out many common natural language tasks. • Access it at nltk.sourceforge.net • There are versions for Windows, OS X, Unix, Linux. Detailed instructions on Installation tab • In addition to the toolkit you will need two other modules: tkinter and Numeric. We haven’t been able to get numeric to install smoothly with Python 2.4 under Windows, only with 2.3. • You do also want the contrib and data packages. • Pay attention to what INSTALL.TXT in the data package says about the NLTK_CORPORA path. Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html

Accessing NLTK • Standard Python import command • >>> from nltk.corpus import gutenberg • >>> gutenberg.items() • ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] Or • >>> import nltk.corpus • >>> nltk.corpus.gutenberg.items() • ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'blake-songs.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html

Modules • The NLTK modules include: • token: classes for representing and processing individual elements of text, such as words and sentences • probability: classes for representing and processing probabilistic information. • tree: classes for representing and processing hierarchical information over text. • cfg: classes for representing and processing context free grammars. • fsa: finite state automata • tagger: tagging each word with a part-of-speech, a sense, etc • parser: building trees over text (includes chart, chunk and probabilistic parsers) • classifier: classify text into categories (includes feature, featureSelection, maxent, naivebayes • draw: visualize NLP structures and processes • corpus: access (tagged) corpus data • We will cover some of these explicitly as we reach topics. Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html

One Simple Example IDLE 1.0.3 >>> from nltk.tokenizer import * >>> text_token = Token(TEXT='Hello world. This is a test file.') >>> print text_token <Hello world. This is a test file.> >>> WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token) >>> print text_token <[<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>]> >>> print text_token['TEXT'] Hello world. This is a test file. >>> print text_token['WORDS'] [<Hello>, <world.>, <This>, <is>, <a>, <test>, <file.>] Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html

LAB • Detailed documentation and tutorials under the Documentation tab at the Sourceforge site. • Work through the “gentle introduction” and “elementary language processing” tutorials on the NLTK: nltk.sourceforge.net/tutorial/introduction/index.html Examples taken from: nltk.sourceforge.net/tutorial/introduction/index.html

Natural Language Toolkit