190 likes | 354 Views
I256: Applied Natural Language Processing. Marti Hearst Sept 25, 2006. Shallow Parsing. Break text up into non-overlapping contiguous subsets of tokens. Also called chunking, partial parsing, light parsing. What is it useful for? Entity recognition people, locations, organizations
E N D
I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006
Shallow Parsing • Break text up into non-overlapping contiguous subsets of tokens. • Also called chunking, partial parsing, light parsing. • What is it useful for? • Entity recognition • people, locations, organizations • Studying linguistic patterns • gave NP • gave up NP in NP • gave NP NP • gave NP to NP • Can ignore complex structure when not relevant
A Relationship between Segmenting and Labeling • Tokenization segments the text • Tagging labels the text • Shallow parsing does both simultaneously.
Chunking vs. Full Syntactic Parsing “G.K. Chesterton, author of The Man who was Thursday”
Representations for Chunks • IOB tags • Inside, outside, and begin • Why do we need a begin tag?
Representations for Chunks • Trees • Chunk structure is a two-level tree that spans the entire text, containing both chunks and non-chunks
CONLL Collection • From the Conference on Natural Language Learning Competition from 2000 • Goal: create machine learning methods to improve on the chunking task
CONLL Collection • Data in IOB format from WSJ: • Word POS-tag IOB-tag • Training set: 8936 sentences • Test set: 2012 sentences • Tags from the Brill tagger • Penn Treebank Tags • Evaluation measure: F-score • 2*precision*recall / (recall+precision) • Baseline was: select the chunk tag that is most frequently associated with the POS tag, F =77.07 • Best score in the contest was F=94.13
nltk_lite and CONLL2000 Note that raw hides the IOB format
nltk_lite and CONLL2000 pp() stands for “pretty print” and applies to the Tree data structure.
Chunking with Regular Expressions • This time we write regex’s over TAGS rather than words • <DT><JJ>?<NN> • <NN.*> • <JJ|NN>+ • Compile them with parse.ChunkRule() • rule = parse.ChunkRule(‘<DT|NN>+’) • chunkparser = parse.RegexpChunk([rule], chunk_node = ‘NP’) • Resulting object is of type Tree • Top-level node called S • Can change this label if you want, in third argument to RegexpChunk
Chunking with Regular Expressions • Rule application is sensitive to order
Chinking • Specify what does not go into a chunk. • Kind of like specifying punctuation as being not alphanumeric and spaces. • Can be more difficult to think about.
Practice Regexp Chunking • Write rules to be able to produce this kind of chunking:
Next Time • Evaluating Shallow Parsing • Begin Text Summarization