1 / 45

Modern Information Retreival

Modern Information Retreival. Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3. Introduction. Text main form of communicating knowledge. Document loosely defined, denote a single unit of information. can be any physical unit

ull
Download Presentation

Modern Information Retreival

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modern Information Retreival Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3

  2. Introduction • Text • main form of communicating knowledge. • Document • loosely defined, denote a single unit of information. • can be any physical unit • a file • an email • a Web Page

  3. Introduction • Document • Syntax and structure • Semantics • Information about itself

  4. Introduction • Document Syntax • Implicit, or expressed in a language (e.g, TeX) • Powerful languages: easier to parse, difficult to convert to other formats. • Open languages are better (interchange) • Semantics of texts in natural language are not easy for a computer to understand • Trend: languages which provides information on structure, format and semantics being readable by human and computers

  5. Introduction • New applications are pushing for format such that information can be represented independetly of style. • Style: defined by the author, but the reader may decide part of it • Style can include treatment of other media

  6. Metadata • “Data about the data” • e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. • Descriptive Metadata • Author, source, length • Dublin Core Metadata Element Set • Semantic Metadata • Characterizes the subject matter within the document contents • MEDLINE

  7. Metadata • Metadata information on Web documents • cataloging, content rating, property rights, digital signatures • New standard: Resource Description Framework • description of Web resources to facilitate automated processing of information • nodes and attched atribute/values pairs • Metadescription of non-textual objects • keyword can be used to search the objects

  8. Predicate Statement RDF Model • A model is a collection of statements • Statement := (predicate,subject,object) • Predicate is a resource • Subject is a resource • Object is either a resource or a literal Subject Object

  9. Example shown in triples view

  10. RDF model and natural language • Subject. In grammar, this is the noun or noun phrase that is the doer of the action. In the sentence “The company sells batteries,” the subject is “the company.” • Predicate. In grammar, this is the part of a sentence that modifies the subject and includes the verb phrase. In our sentence, the predicate is the phrase “sells” • Object. In grammar this is a noun that is acted upon by the verb. In our sentence, the object is the noun “batteries.”

  11. XML vs. RDF • RDF is not just an XML dialect. • XML: • Has a tree structure data model. • Only nodes are labeled. • RDF: • Has a graph structure data model. • Both edges (properties) and nodes (subjects/objects) are labeled.

  12. CE Ganji http://ce.sharif.edu Sharif Linking Statements • The subject of one statement can be the object of another • Such collections of statements form a directed, labeled graph studentOF departmentOF hasHomePage

  13. RDF Graph: ‘anonymous’ nodes Person PersonName Literal Person12345 person.name value Jonathan first last value Borden

  14. How can RDF be implemented • Usually RDF/XML syntax • However other notations are possible • e.g. Notation3: • Buddy Belden owns a business. • The business has a Web site accessible at http://www.c2i2.com/~budstv. • Buddy is the father of Lynne. • <#Buddy> <#owns> <#business>. • <#business> <#has-website> <http://www.c2i2.com/~budstv>. • <#Buddy> <#father-of> <#Lynne>.

  15. Converting N3 to RDF • Jena toolkit can do such conversion

  16. XML Syntax for RDF • RDF has an XML syntax that has a specific meaning: • Every Description element describes a resource • Every attribute or nested element inside a Description is apropertyof that Resource • We can refer to resources by using URIs <rdf:Description about="some.uri/person/ganji"> <studentOf resource="some.uri/Sharif/CE"/> </Description> <Description about="some.uri/Sharif/CE"> <hasHomePage>http://ce.sharif.edu</hasHomePage> <departmentOf resource="some.uri/~Sharif"/> </rdf:Description>

  17. RDF type • RDF predifined property • Its value – a resource that represent a category or class • Its subject – Instance of that category or class prefix ex: URI: http://www.example.org/terms

  18. Containers • Containers are collections • they allow grouping of resources (or literal values) • It is possible to make statements about the container (as a whole) or about its members individually • It is also possible to create collections based on URI patterns • for example, all files in a particular web site

  19. RDF containers • Bag: (A resource having type rdf:Bag) • Represents an unordered list of resources or literals • Duplicated values are prermitted • Sequence: (A resource having type rdf:Seq) • Represents ordered list of resources or literal • Duplicated values are permitted • Alternatives: (A resource having type rdf:Alt) • Represents group of resources or literals that are alternatives

  20. http://www.w3.org/TR/REC-rdf-syntax dc:Creator rdf:Type rdf:Seq rdf:_1 rdf:_2 “Ora Lassila” “Ralph Swick” Sequence example

  21. Bag example

  22. RDF Schema (RDFS) • RDF gives a formalism for meta data annotation, and a way to write it down in XML, but it does not give any special meaning to vocabulary such as subClassOf or type • RDF Schema allows you to define vocabulary terms and the relations between those terms • it gives “extra meaning” to particular RDF predicates and resources • this “extra meaning”, or semantics, specifies how a term should be interpreted

  23. Core Classes & Properties rdfs:Resource rdfs:Literal rdfs:XMLLiteral rdfs:Class rdfs:Property Core Classes rdfs:Type rdfs:SubClassOf rdfs:SubPropertyOf rdfs:Domain rdfs:Range rdfs:Label rdfs:Comment Core Properties

  24. RDFS Examples <Person,type,Class> <hasColleague,type,Property> <Professor,subClassOf,Person> <Carole,type,Professor> <hasColleague,range,Person> <hasColleague,domain,Person>

  25. RDF/RDFS “Liberality” • No distinction between classes and instances (individuals) <Species,type,Class> <Lion,type,Species> <Leo,type,Lion> • Properties can themselves have properties <hasDaughter,subPropertyOf,hasChild> <hasDaughter,type,familyProperty> • No distinction between language constructors and ontology vocabulary, so constructors can be applied to themselves/each other <type,range,Class> <Property,type,Class> <type,subPropertyOf,subClassOf>

  26. Problems with RDFS • RDFS too weak to describe resources in sufficient detail • No localised range and domain constraints • Can’t say that the range of hasChild is person when applied to persons and elephant when applied to elephants • No existence/cardinality constraints • Can’t say that all instances of person have a mother that is also a person, or that persons have exactly 2 parents • No transitive, inverse or symmetrical properties • Can’t say that isPartOf is a transitive property, that hasPart is the inverse of isPartOf or that touches is symmetrical • … • Difficult to provide reasoning support • No “native” reasoners for non-standard semantics • May be possible to reason via FO axiomatisation

  27. RDF(S) tools • Read RDF data • Parsers: Jena, Redland, SWI-Prolog • Validators: W3C RDF validation service • Editors: IsaViz, RDF Author, RDFEd, InferEd • Store RDF data (XML format, tripples or relational/oo DB) • Sesame, RSSDB, RDFLib • Use RDF data (applications, RSS news, etc.) • Manipulate RDF data (inference, query, etc.) • Jena RDQL, etc. • Example: SELECT ?person, ?knows WHERE (?x <http://xmlns.com/foap/knows> ?z), (?x <http://xmlns.com/foap/name> ?person), (?z <http://xmlns.com/foap/name> ?knows)

  28. RDF Validators • RDF Validation Service • http://www.w3.org/RDF/Validator/ • In general all the RDF parsers do some kind of validation

  29. References • RDF Resource Guide: • http://www.ilrt.bris.ac.uk/discovery/rdf/resources/ • http://www.w3.org/RDF • http://www.w3.org/RDF/Validator/

  30. Text • Text coding in bits • EBCDIC, ASCII • Initially, 7 bits. Later, 8 bits • Unicode • 16 bits, to accommodate oriental languages

  31. Text • Formats • No single format exists • IR system should retrieve information from different formats • Past: IR systems convert the documents • Today: IR systems use filters

  32. Text • Formats • Formats for document interchange (RTF) • Formats for displaying (PDF, PostScript) • Formats for encode email (MIME) • Compressed files • uuencode/uudecode, binhex

  33. Text • Information Theory • Amount of information is related to the distribution of symbols in the document. • Entropy: • Definition of entropy depends on the probabilities of each symbol. • Text models are used to obtain those probabilites

  34. Text • Example - Entropy • 001001011011

  35. Text • Example - Entropy • 111111111111

  36. Text • Modeling Natural Language • Symbols: separate words or belong to words • Symbols are not uniformly distributed • binomial model • Dependency of previous symbols • k-order markovian model • We can take words as symbols

  37. Text • Modeling Natural Language • Words distribution inside documents • Zipf´s Law: i-th most frequent word appears 1/i times of the most frequent word, hence i-th frequent word appears: • Real data fits better with  between 1.5 and 2.0

  38. Text • Modeling Natural Language • Example - word distibution (Zipf’s Law) • V=1000,  = 2 • most frequent word: n=300 • 2nd most frequent: n=76 • 3rd most frequent: n=33 • 4th most frequent: n=19

  39. Text • Modeling Natural Language • Number of distinct words • Heaps’ Law: • Set of different words is fixed by a constant, but the limit is too high

  40. Text • Modeling Natural Language • Heaps’ Law example • k between 10 and 100,  is less than 1 • example: n=400000,  = 0.5 • K=25, V=15811 • K=35, V=22135

  41. Text • Modeling Natural Language • Length of the words • defines total space needed for vocabulary • Heaps’ Law: length increases logarithmically with text size. • In practice, a finit-state model is used • space has p=0.2 • space cannot apear twice subsequently • there are 26 letters

  42. Text • Similarity Models • Distance Function • Should be symmetric and satisfy triangle inequality • Hamming Distance • number of positions that have different characters reverse receive

  43. Text • Similarity Models • Edit (Levenshtein) Distance • minimum number of operations needed to make strings equal survey surgery • superior for modeling syntatic errors • extensions: weights, transpositions, etc

  44. Text • Similarity Models • Longest Common Subsequence (LCS) survey - surgery LCS: surey • Documents: lines as symbols (diff in Unix) • time consuming

  45. Conclusions • Text is the main form of communicating knowledge. • Documents have syntax, structure and semantics • Metadata: information about data • Formats of text • Modeling Natural Language • Entropy • Distribution of symbols • Similarity

More Related