90 likes | 222 Views
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992). University of Pennsylvania, LINC Laboratory 4.5 million words of American English Annotation of naturally-occurring text for linguistic structure. Tree Linguistic Components.
E N D
Penn TreeBank Project“A Bank of Linguistic Trees”(as of 11/1992) • University of Pennsylvania, LINC Laboratory • 4.5 million words of American English • Annotation of naturally-occurring text for linguistic structure
Tree Linguistic Components • Tokenization • Treatment of punctuation, words, etc. as separate tokens • Children’s Children ’s • Part-of-speech (POS) tagging • Text first assigned POS tags automatically • Human annotators correct first-pass POS tags • Bracketing • (Fidditch, a deterministic parser (Hindle 1983, 1989) ) • Two-stage parsing process made explicit with brackets
Penn TreeBank: Brown Corpus (as of 11/1992) • POS Tags (Tokens) 1,172,041 • Skeletal Parsing (Tokens) 1,172,041
You know you’re in trouble when … “0. You will always have a certain amount of error. Sometimes there is just no way to find the head of a phrase, because it is tagged or parsed completely incorrectly. (no big surprise, that)” Robert MacIntyre Programmer/Data Manager Penn Treebank Project robertm@unagi.cis.upenn.edu ftp://ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2
Tree Conversion: Clean Case • ( END_OF_TEXT_UNIT ) • ( END_OF_TEXT_UNIT ) • ( END_OF_TEXT_UNIT ) • ( (`` ``) • (S • (S • (NP (PRP I) ) • (VP (VBP leave) • (NP (DT this) (NN church) ) • (PP (IN with) • (NP (DT a) (NN feeling) • (SBAR (IN that) • (S • (NP (DT a) (JJ great) (NN weight) ) • (AUX (VBZ has) ) • (VP (VBN been) • (VP (VBN lifted) • (PP (IN off) • (NP (PRP$ my) (NN heart) )))))))))) • (, ,) • (S • (NP (PRP I) ) • (AUX (VBP have) ) • (VP • (VP (VBN left) • (NP (PRP$ my) (NN grudge) ) • (PP (IN at) • (NP (DT the) (NN altar) ))) • (CC and) • (VP (VBN forgiven) • (NP (PRP$ my) (NN neighbor) ))))) • ('' '') (. .) ) • ( END_OF_TEXT_UNIT ) cb08_42 ``I leave this church with a feeling that a great weight has been lifted off my heart, I have left my grudge at the altar and forgiven my neighbor''.
Tree Conversion : Problematic Case (NP (DT the) (NNS Women) ) (POS 's) (NN S.P.C.A.) ))) (: ;) (: ;) (NP (NP ($ $) (CD 15,000) ) (S (NP (-NONE- T) ) (AUX (TO to) ) (VP (VB pay) (NP (NP (CD six) (NNS policemen) ) • ( (S • (NP (PRP He) ) • (VP (VBD reported) • (SBAR (IN that) • (S • (NP • (NP (DT the) (NN city) ) • (POS 's) (NNS contributions) • (PP (IN for) • (NP (NN animal) (NN care) ))) • (VP (VBD included) • (NP • (NP ($ $) (CD 67,000) • (PP (TO to) • (NP • (NP (DT the) (NNS Women) ) • (POS 's) (NN S.P.C.A.) ))) • (: ;) (: ;) • (NP • (NP ($ $) (CD 15,000) ) • (S • (NP (-NONE- T) ) • (AUX (TO to) ) • (VP (VB pay) • (NP • (NP (CD six) (NNS policemen) ) • (VP (VBN assigned) • (PP (IN as) • (NP (NN dog) (NNS catchers) ))))))) • (CC and) • (NP • (NP ($ $) (CD 15,000) ) • (S • (NP (-NONE- T) ) • (AUX (TO to) ) • (VP (VB investigate) • (NP (NN dog) (NNS bites) )))))))))) • (. .) ) • ( END_OF_TEXT_UNIT ) ca09_46 He reported that the city's contributions for animal care included $67,000 to the Women's S.P.C.A.;; $15,000 to pay six policemen assigned as dog catchers and $15,000 to investigate dog bites.
Summary of Problems Encountered • Typing Errors • Punctuation duplication in data • Special notation for delimiter characters • RRB, LRB, RSB, LSB, RCB, LCB • Special Null Elements • ( -NONE- ) * 0 T NIL ** Conventions for final output need to consider these lessons
Future Recommendations • Put POS tree data into proper database • Increases confidence in correctness of data • Minimizes error • Spend more effort upfront *once* to clean data • SQL queries more reusable than (write-only) perl scripts • Due to random graduate student ability • If DB option not available • Avoid duplication of data in final output • Avoid text delimiters that exist as data tokens (“ ‘ , \s ) • Do thoughtful labeling conventions