Metadata Mining

Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel May 2004

Outlines • Metadata Mining • Proposed Approach • Experimental Results • Future Research Masoud Makrehchi

Metadata • Metadata Definition: • data about data, for example a library catalogue • Metadata Application • cataloging (Item and Collections) • resource discovery • e-commerce and digital signatures • Intelligent Software Agents • content rating • intellectual property rights (IP rights) • Semantic Web • Learning Objects (LO): LOM standards such as IEEE LOM, DC, SCORM, CANCORE Masoud Makrehchi Metadata Mining

Metadata LO metadata Map legend RDF Library catalogue Masoud Makrehchi Metadata Mining

Metadata Masoud Makrehchi Metadata Mining

Metadata Specifications Masoud Makrehchi Conceptual data architecture Metadata Mining

Metadata Specifications • More structured (usually semi-structured) • Low dimensional (~400,000 vs. 15,791 terms) • More homogenous than raw data • Less noisy (stopwords) • Less time-varying Masoud Makrehchi Metadata Mining

Metadata Specifications Masoud Makrehchi Metadata Mining

Motivations • No access to the raw data • The raw data is inconvenient for mining (heterogeneous formats and non-text format) • Diversity of metadata standards, the need to merge different metadata repositories • Attempt to use background knowledge in mining process Masoud Makrehchi Metadata Mining

Applications • Either alternative or complementary approach, depends on having access to the raw data • Metadata enrichment (keyword extraction & automatic title generation), towards automatic metadata generation • Semi-structured data mining, such as XML,RDF, OWL and other semantic tagged scripts • Semi-automatic Ontology extraction • information integration based on metadata (LOs aggregation and integration) Masoud Makrehchi Metadata Mining

Outlines • Metadata Mining • Proposed Approach • Experimental Results • Future Research Masoud Makrehchi

Assumptions • There is a corpora or a metadata collection, usually labeled • No natural Language Processing (NLP), and no computational linguistic • No semantic analysis and processing • Metadata as a text document, no graph, tree or other data structures • No dictionaries, thesauruses, and any other global vocabularies • Using only frequency-based measures Masoud Makrehchi Proposed Approach

Problem Statement • Mining metadata as a text document using classical document mining techniques by considering the background knowledge which is represented by proposed data model and using fuzzy approach to capture expert knowledge Masoud Makrehchi Proposed Approach

Metadata Representation • Metadata as a text document (semi-structured format) • Using only statistical measures • frequency-based measures (the location and the order of words don’t matter) Masoud Makrehchi Proposed Approach

Metadata Representation Masoud Makrehchi

Metadata Representation Title’s objective semantic Masoud Makrehchi

Metadata Representation Whole document’s semantic about title Masoud Makrehchi

Metadata Representation Title’s objective semantic Title’s subjective semantic Masoud Makrehchi

Metadata Representation • The meaning of a metadata is broken into a set of objective semantics which can be easily extracted using a parser (to extract contents and attributes of every tag) • Adopting document vector space model (to use all classical document mining techniques) Metadata representation model = a set of objective semantics + Salton’s vector space model Masoud Makrehchi Multiple-partition document vector space model

Metadata Representation Concepts about the domain (discourse) Background knowledge Expert Document Masoud Makrehchi

Metadata Representation Concepts about the domain (discourse) Background knowledge- subjective abstract author subject Expert title Background knowledge- objective Masoud Makrehchi

Metadata Representation Metadata as a text document The mining process is translated into a classical document mining. Metadata mining has no advantages because the background knowledge has been ignored. Masoud Makrehchi Proposed Approach

Metadata Representation Metadata a a Multiple-partition Text document The mining process is translated into a classical document mining Considering the background knowledge Masoud Makrehchi Proposed Approach

Metadata Representation • Vector Space Model- fuzzy representation Vocabulary di Masoud Makrehchi Proposed Approach

Metadata Representation • Multi-Partition Vector Space Model Vocabulary di Masoud Makrehchi To model the objective semantic (a potion of background knowledge) Proposed Approach

Metadata Representation • Multi-Partition Vector Space Model Masoud Makrehchi Proposed Approach

Metadata Representation • Converting to standard vector space model Masoud Makrehchi Proposed Approach

Metadata Representation • Weight of each partition • To be determined by expert, for example: • Wabstract=1.0, • Wtitile=1.5. Masoud Makrehchi Proposed Approach

Metadata Representation • Membership degree of each term in every partition • By expert (considering the vocabulary), • Statistical (considering the collection) • Absolute frequency-based measures (like tfidf), • Relative frequency-based (Geometric) measures (location of each term in the partition). Masoud Makrehchi Proposed Approach

Frequency Measures • Types of Frequency Measures • Within partition • Within document: by document-term frequency (like tf) • Within class: by class-term frequency (like term significance) • Within collection: by collection-term frequency (like mean of term significances) Masoud Makrehchi Proposed Approach

Class-Term Matrix • Document-Term Matrix • The matrix is very large. (thousands of documents in the collection and millions of terms in the vocabulary), • The matrix is sparse. Usually only small number of elements in the matrix are non zero (zipf's law), • The matrix is dual with respect to terms and documents. Masoud Makrehchi Proposed Approach

Class-Term Matrix • Class-Term Matrix • The matrix is large. (tens of classes and millions of terms in the vocabulary), • The matrix is less sparse, • The matrix is dual with respect to terms and classes. Masoud Makrehchi Proposed Approach

Class-Term Matrix Significance factor: Representing the relationship between a term and a class Masoud Makrehchi Proposed Approach

Concept Terminology • Terminology • All terms which occur in a class (or concept) • A fuzzy set of all terms in the vocabulary Masoud Makrehchi Proposed Approach

Term Definition • Definition • All concepts (classes) which the term belongs to • A fuzzy set of all concepts (classes) Masoud Makrehchi Proposed Approach

Term Relationships Masoud Makrehchi Proposed Approach

Fuzzy Similarity For example: “sum” and “product” will be similar, if they are partially (fuzzily) co-occurring. Similarity is a bi-directional relationship. Masoud Makrehchi Proposed Approach

Fuzzy Inclusion For example: “web” will include “world”, if “world” occurs wherever “web” appears. Inclusion is one-directional relationship. Masoud Makrehchi Proposed Approach

Fuzzy Document Model • Fuzzy document representation • a fuzzy set of all terms in the vocabulary • obtaining Keyword set for the document; either a threshold on the fuzzy set or term categorization Masoud Makrehchi Proposed Approach

Term Categories • Term Categorization: Categorizing all terms into three main groups • Features: Most frequent terms within a class • Keywords: More frequent terms within some documents belonging to a given class • Stopwords: More frequent terms in all classes stopwords features Collection Class Document keywords Masoud Makrehchi Proposed Approach

Domain-specific Stopwords • Stopwords: not contributing to the meaning of the document • General stopwords (a, an, the, in, …) • Domain-specific stopwords • Politics: Government, State • Medicine: Patient • Education: Learner, Instructor • Social sciences: Society • Anthropology: Human Masoud Makrehchi Proposed Approach

y c Features n e u q e r F Stopwords s s a l c - n i Keywords h t i W Within-collection Frequency Class-Collection Map Introducing Class-Collection Map to visualize the location of each category Masoud Makrehchi Proposed Approach

y c n e Features u q Stopwords e r F s s a l c - n i h t i Keywords W Within-collection Frequency Class-Collection Map Introducing Class-Collection Map to visualize the location of each category Masoud Makrehchi Proposed Approach

y c Features n e u q e r Stopwords F s s a l c - n i h Keywords t i W Within-collection Frequency Class-Collection Map Introducing Class-Collection Map to visualize the location of each category Masoud Makrehchi Proposed Approach

Contributions • A model for metadata representation: Multiple- partition document vector space model • Class-term model, representing the relationship between classes and the vocabulary • New fuzzy representations for documents, terms and concepts (word definition and terminology) Masoud Makrehchi Proposed Approach

Contributions • Class-collection map to visualize the distribution of terms • Representing the keyword set of a document by a fuzzy model • New definitions for fuzzy similarity, fuzzy inclusion, and fuzzy term-relation based on fuzzy term definition • A framework for extracting domain-specific stopwords Masoud Makrehchi Proposed Approach

Outlines • Problem Statement: Metadata Mining • Proposed Approach • Experimental Results • Future Research Masoud Makrehchi

Data Set • Citeseer computer science directory (http://citeseer.ist.psu.edu/directory.html) • ~400,000 terms (vocabulary size) • 17 classes • 2,912 documents • Instead of data (in PDF or PS), we collected BibTeX data (kind of metadata or catalogue) and abstracts of the articles. Masoud Makrehchi Preliminary Results

Data Set Masoud Makrehchi Preliminary Results

Metadata Mining

Metadata Mining

Presentation Transcript

Metadata

Metadata

Metadata

METADATA

Data Mining – Revealing the Sound Recordings Metadata Meaning

Metadata

Metadata

METADATA

Metadata

Metadata

Metadata

Metadata

Learning Object Metadata Mining

Metadata

MetaData

METADATA

Metadata

Metadata

Metadata

Text Metadata Mining: Exploring its potential*

Metadata