1 / 22

Corpus Annotation with Linked Open Data

Learn about inline and stand-off annotation, NLP interchange, CoNLL-RDF formats, and W3C recommendations for web annotation. See examples and tips for efficient data annotation using RDF standards.

fmarion
Download Presentation

Corpus Annotation with Linked Open Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus Annotation with Linked Open Data John P. McCrae and Thierry Declerck

  2. Summary • Inline and Stand-off annotation • Web Annotation/Open Annotation • NLP Interchange Format • CoNLL-RDF

  3. Why annotate? Ontologies capture facts about concepts, not the usage of words Lexicons capture facts about patterns and systems of usage Sometimes we wish to capture data about specific usage

  4. Inline annotation Typically with XML <divtype="essay"> <head>An Essay on Summer</head> <p>Summer school in <datewhen="1990">MCMXC</date> was never easy; it went by too quickly and left us wanting more.</p> <p>But, as my friend <nametype="person">Peter</name> said with his inimitable <foreignxml:lang="fr">je ne sais quoi</foreign>, <said>It never pays to think too hard</said>. Or, as I would rather put it, <quotexml:lang="es">Que sera, sera</quote>.</p> </div> Pros: Easy and quick to do Cons: Limited expressivity Complicates source document Annotations cannot be added later

  5. Stand-off Annotation Annotation 1 Annotation 2 Annotation 3 Annotation 4 Source Document Annotation File

  6. Web Annotation Annotation recommendation from W3C https://www.w3.org/TR/annotation-model/

  7. Web Annotation: Target and Body • body • element containing the annotation • object property: oa:hasBody (any RDF object) • datatype property: oa:bodyValue (strings) • target • element being annotated • any RDF object, including • oa:Selector (more in a second)

  8. Selector Types • oa:FragmentSelector • Uses the IRI fragment specification defined by the representation's media type. • oa:TextQuoteSelector • Describes a range of text by copying it, and including some of the text immediately before (a prefix) and after (a suffix) it to distinguish it. • oa:TextPositionSelector • Describes a range of text by recording the start and end positions • oa:DataPositionSelector • Describes a range of data by recording the start and end positions of the selection • oa:SvgSelector

  9. Web Annotation Example <http://example.org/name_example> a oa:Annotation ; oa:hasBody [ a oa:TextualBody ; dc11:format"text/plain"^^xsd:string ; rdf:value"PERSON"^^xsd:string ] ; oa:hasTarget [ oa:hasSelector [ a oa:TextQuoteSelector ; oa:exact"James Baker"^^xsd:string ] ; oa:hasSource<https://catalog.ldc.upenn.edu/.../06/wsj_0655.name> ] .

  10. Web Annotation Example oa:TextualBody oa:Annotation format text/plain hasBody PERSON value name_example oa:TextQuoteSelector hasTarget exact hasSelector James Baker source https://catalog.ldc.upenn.edu/.../06/wsj_0655.name

  11. Web Annotation Summary • relatively good uptake • reification • annotation as n:m relation between bodies & targets • with metadata • powerful • annotate all instances of a string at once using a • very verbose • previous example uses 10 triples

  12. NLP Interchange Format • String URIs • e.g., in a web document • can be directly used as object of oa:hasTarget • simple ontology of linguistic data structures • for selected, typical NLP annotations • not covering all you ever need for linguistic annotations ;)

  13. RFC 5147 Allows URIs to refer to fragments in text Character Offsets: https://catalog.ldc.upenn.edu/docs/LDC95T7/raw/06/wsj_0655.txt#char=19,30 Line Offsets: https://catalog.ldc.upenn.edu/docs/LDC95T7/raw/06/wsj_0655.txt#line=0 Integrity Checks: https://.../wsj_0655.txt#char=19,30;md5=67f60186fe687bb898ab7faed17dd96a

  14. NLP Interchange Format

  15. Web Annotation + NIF oa:TextualBody oa:Annotation format text/plain hasBody PERSON value name_example nif:String hasTarget https://.../wsj_0655.name#char=2,22

  16. NLP Interchange Format • Slightly simpler method of reference • Saves some triples • but still very verbose • Less standardised and supported than just Web Annotation

  17. CoNLL-RDF CoNLL is a format family widely used in NLP • tab-separated values • one word per line • one column for annotation type • sentences separated by empty lines • conventions for most types of word-based linguistic • annotation

  18. CoNLL Example Inflection ID Lemma 1_1 Sie sie P PPER nom|pl|*|3 2 SB 1_2 dürfen dürfen V VMFIN pl|3|pres|ind 0 -- 1_3 eine ein A ART acc|sg|fem 4 NK 1_4 Kopie Kopie N NN acc|sg|fem 12 OA 1_5 der der A ART gen|sg|fem 6 NK 1_6 Software Software N NN gen|sg|fem 4 AG 1_7 auf auf A APPR _ 4 MNR 1_8 dem der A ART dat|sg|masc 9 NK Word POS Dependency Structure

  19. CoNLL as RDF (simple) Sie sie WORD LEMMA POS_COARSE P 1_1 POS PPER FEATS nom|pl|*|3 HEAD nif:nextWord 2 EDGE 1_2 SB

  20. CoNLL as RDF (better) Sie sie WORD LEMMA POS_COARSE P 1_1 POS PPER FEATS nom|pl|*|3 SB WORD dürfen 1_2 dürfen nif:nextWord LEMMA

  21. CoNLL as an RDF Tree 1_12 1_2 installieren dürfen 1_1 1_4 1_3 1_6 Kopie Sie 1_5 Easy to query with SPARQL eine Software der

  22. Conclusion RDF is a powerful method of representing corpus annotations But • Not well adopted by many major projects • Can be verbose and hard to read • Limited tool support This should change over the next few years.

More Related