270 likes | 362 Views
HW7 Extracting Arguments for %. Ang Sun asun@cs.nyu.edu March 25, 2012. Outline. File Format Training Generating Training Examples Extracting Features Training of MaxEnt Models Decoding Scoring. File Format.
E N D
HW7 Extracting Arguments for % Ang Sun asun@cs.nyu.edu March 25, 2012
Outline • File Format • Training • Generating Training Examples • Extracting Features • Training of MaxEnt Models • Decoding • Scoring
File Format • Statistics Canada said service-industry <ARG1> output </ARG1> in August <SUPPORT> rose </SUPPORT> 0.4 <PRED class="PARTITIVE-QUANT"> % </PRED> from July .
Generating Training Examples • Positive Example • Only one positive example for a sentence • The one with the annotation ARG1
Generating Training Examples • Negative Examples • Two methods! • Method 1: consider any token that has one of the following POSs • NN 1150 • NNS 905 • NNP 205 • JJ 25 • PRP 24 • CD 21 • DT 16 • NNPS 13 • VBG 2 • FW 1 • IN 1 • RB 1 • VBZ 1 • WDT 1 • WP 1 Too many negative examples!
Generating Training Examples • Negative Examples • Two methods! • Method 2: only consider head tokens
Extracting Features f: candToken=output
Extracting Features f: tokenBeforeCand=service-industry
Extracting Features f: tokenAfterCand=in
Extracting Features f: tokensBetweenCandPRED=in_August_rose_0.4
Extracting Features f: numberOfTokensBetween=4
Extracting Features f: exisitVerbBetweenCandPred=true
Extracting Features f: exisitSUPPORTBetweenCandPred=true
Extracting Features f: candTokenPOS=NN
Extracting Features f: posBeforeCand=NN
Extracting Features f: posAfterCand=IN
Extracting Features f: possBetweenCandPRED=IN_NNP_VBD_CD
Extracting Features f: BIOChunkChain= I-NP_B-PP_B-NP_B-VP_B-NP_I-NP
Extracting Features f: chunkChain= NP_PP_NP_VP_NP
Extracting Features f: candPredInSameNP=False
Extracting Features f: candPredInSameVP=False
Extracting Features f: candPredInSamePP=False
Extracting Features f: shortestPathBetweenCandPred= NP_NP-SBJ_S_VP_NP-EXT
Training of MaxEnt Model • Each training example is one line • candToken=output . . . . . class=Y • candToken=Canada . . . . . Class=N • Put all examples in one file, the training file • Use the MaxEnt wrapper or the program you wrote in HW5 to train your relation extraction model
Decoding • For each sentence • Generate testing examples as you did for training • One example per feature line (without class=(Y/N)) • Apply your trained model to each of the testing examples • Choose the example with the highest probability returned by your model as the ARG1 • So there should be and must be one ARG1 for each sentence
Scoring • As you are required to tag only one ARG1 for each sentence • Your system will be evaluated based on accuracy • Accuracy = #correct_ARG1s / #sentences