620 likes | 902 Views
Metadata Mining. Masoud Makrehchi Supervisor: Prof. Mohamed Kamel May 2004. Outlines. Metadata Mining Proposed Approach Experimental Results Future Research. Metadata. Metadata Definition: data about data, for example a library catalogue Metadata Application
E N D
Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel May 2004
Outlines • Metadata Mining • Proposed Approach • Experimental Results • Future Research Masoud Makrehchi
Metadata • Metadata Definition: • data about data, for example a library catalogue • Metadata Application • cataloging (Item and Collections) • resource discovery • e-commerce and digital signatures • Intelligent Software Agents • content rating • intellectual property rights (IP rights) • Semantic Web • Learning Objects (LO): LOM standards such as IEEE LOM, DC, SCORM, CANCORE Masoud Makrehchi Metadata Mining
Metadata LO metadata Map legend RDF Library catalogue Masoud Makrehchi Metadata Mining
Metadata Masoud Makrehchi Metadata Mining
Metadata Specifications Masoud Makrehchi Conceptual data architecture Metadata Mining
Metadata Specifications • More structured (usually semi-structured) • Low dimensional (~400,000 vs. 15,791 terms) • More homogenous than raw data • Less noisy (stopwords) • Less time-varying Masoud Makrehchi Metadata Mining
Metadata Specifications Masoud Makrehchi Metadata Mining
Motivations • No access to the raw data • The raw data is inconvenient for mining (heterogeneous formats and non-text format) • Diversity of metadata standards, the need to merge different metadata repositories • Attempt to use background knowledge in mining process Masoud Makrehchi Metadata Mining
Applications • Either alternative or complementary approach, depends on having access to the raw data • Metadata enrichment (keyword extraction & automatic title generation), towards automatic metadata generation • Semi-structured data mining, such as XML,RDF, OWL and other semantic tagged scripts • Semi-automatic Ontology extraction • information integration based on metadata (LOs aggregation and integration) Masoud Makrehchi Metadata Mining
Outlines • Metadata Mining • Proposed Approach • Experimental Results • Future Research Masoud Makrehchi
Assumptions • There is a corpora or a metadata collection, usually labeled • No natural Language Processing (NLP), and no computational linguistic • No semantic analysis and processing • Metadata as a text document, no graph, tree or other data structures • No dictionaries, thesauruses, and any other global vocabularies • Using only frequency-based measures Masoud Makrehchi Proposed Approach
Problem Statement • Mining metadata as a text document using classical document mining techniques by considering the background knowledge which is represented by proposed data model and using fuzzy approach to capture expert knowledge Masoud Makrehchi Proposed Approach
Metadata Representation • Metadata as a text document (semi-structured format) • Using only statistical measures • frequency-based measures (the location and the order of words don’t matter) Masoud Makrehchi Proposed Approach
Metadata Representation Masoud Makrehchi
Metadata Representation Masoud Makrehchi
Metadata Representation Title’s objective semantic Masoud Makrehchi
Metadata Representation Whole document’s semantic about title Masoud Makrehchi
Metadata Representation Title’s objective semantic Title’s subjective semantic Masoud Makrehchi
Metadata Representation • The meaning of a metadata is broken into a set of objective semantics which can be easily extracted using a parser (to extract contents and attributes of every tag) • Adopting document vector space model (to use all classical document mining techniques) Metadata representation model = a set of objective semantics + Salton’s vector space model Masoud Makrehchi Multiple-partition document vector space model
Metadata Representation Concepts about the domain (discourse) Background knowledge Expert Document Masoud Makrehchi
Metadata Representation Concepts about the domain (discourse) Background knowledge- subjective abstract author subject Expert title Background knowledge- objective Masoud Makrehchi
Metadata Representation Metadata as a text document The mining process is translated into a classical document mining. Metadata mining has no advantages because the background knowledge has been ignored. Masoud Makrehchi Proposed Approach
Metadata Representation Metadata a a Multiple-partition Text document The mining process is translated into a classical document mining Considering the background knowledge Masoud Makrehchi Proposed Approach
Metadata Representation • Vector Space Model- fuzzy representation Vocabulary di Masoud Makrehchi Proposed Approach
Metadata Representation • Multi-Partition Vector Space Model Vocabulary di Masoud Makrehchi To model the objective semantic (a potion of background knowledge) Proposed Approach
Metadata Representation • Multi-Partition Vector Space Model Masoud Makrehchi Proposed Approach
Metadata Representation • Converting to standard vector space model Masoud Makrehchi Proposed Approach
Metadata Representation • Weight of each partition • To be determined by expert, for example: • Wabstract=1.0, • Wtitile=1.5. Masoud Makrehchi Proposed Approach
Metadata Representation • Membership degree of each term in every partition • By expert (considering the vocabulary), • Statistical (considering the collection) • Absolute frequency-based measures (like tfidf), • Relative frequency-based (Geometric) measures (location of each term in the partition). Masoud Makrehchi Proposed Approach
Frequency Measures • Types of Frequency Measures • Within partition • Within document: by document-term frequency (like tf) • Within class: by class-term frequency (like term significance) • Within collection: by collection-term frequency (like mean of term significances) Masoud Makrehchi Proposed Approach
Class-Term Matrix • Document-Term Matrix • The matrix is very large. (thousands of documents in the collection and millions of terms in the vocabulary), • The matrix is sparse. Usually only small number of elements in the matrix are non zero (zipf's law), • The matrix is dual with respect to terms and documents. Masoud Makrehchi Proposed Approach
Class-Term Matrix • Class-Term Matrix • The matrix is large. (tens of classes and millions of terms in the vocabulary), • The matrix is less sparse, • The matrix is dual with respect to terms and classes. Masoud Makrehchi Proposed Approach
Class-Term Matrix Significance factor: Representing the relationship between a term and a class Masoud Makrehchi Proposed Approach
Concept Terminology • Terminology • All terms which occur in a class (or concept) • A fuzzy set of all terms in the vocabulary Masoud Makrehchi Proposed Approach
Term Definition • Definition • All concepts (classes) which the term belongs to • A fuzzy set of all concepts (classes) Masoud Makrehchi Proposed Approach
Term Relationships Masoud Makrehchi Proposed Approach
Fuzzy Similarity For example: “sum” and “product” will be similar, if they are partially (fuzzily) co-occurring. Similarity is a bi-directional relationship. Masoud Makrehchi Proposed Approach
Fuzzy Inclusion For example: “web” will include “world”, if “world” occurs wherever “web” appears. Inclusion is one-directional relationship. Masoud Makrehchi Proposed Approach
Fuzzy Document Model • Fuzzy document representation • a fuzzy set of all terms in the vocabulary • obtaining Keyword set for the document; either a threshold on the fuzzy set or term categorization Masoud Makrehchi Proposed Approach
Term Categories • Term Categorization: Categorizing all terms into three main groups • Features: Most frequent terms within a class • Keywords: More frequent terms within some documents belonging to a given class • Stopwords: More frequent terms in all classes stopwords features Collection Class Document keywords Masoud Makrehchi Proposed Approach
Domain-specific Stopwords • Stopwords: not contributing to the meaning of the document • General stopwords (a, an, the, in, …) • Domain-specific stopwords • Politics: Government, State • Medicine: Patient • Education: Learner, Instructor • Social sciences: Society • Anthropology: Human Masoud Makrehchi Proposed Approach
y c Features n e u q e r F Stopwords s s a l c - n i Keywords h t i W Within-collection Frequency Class-Collection Map Introducing Class-Collection Map to visualize the location of each category Masoud Makrehchi Proposed Approach
y c n e Features u q Stopwords e r F s s a l c - n i h t i Keywords W Within-collection Frequency Class-Collection Map Introducing Class-Collection Map to visualize the location of each category Masoud Makrehchi Proposed Approach
y c Features n e u q e r Stopwords F s s a l c - n i h Keywords t i W Within-collection Frequency Class-Collection Map Introducing Class-Collection Map to visualize the location of each category Masoud Makrehchi Proposed Approach
Contributions • A model for metadata representation: Multiple- partition document vector space model • Class-term model, representing the relationship between classes and the vocabulary • New fuzzy representations for documents, terms and concepts (word definition and terminology) Masoud Makrehchi Proposed Approach
Contributions • Class-collection map to visualize the distribution of terms • Representing the keyword set of a document by a fuzzy model • New definitions for fuzzy similarity, fuzzy inclusion, and fuzzy term-relation based on fuzzy term definition • A framework for extracting domain-specific stopwords Masoud Makrehchi Proposed Approach
Outlines • Problem Statement: Metadata Mining • Proposed Approach • Experimental Results • Future Research Masoud Makrehchi
Data Set • Citeseer computer science directory (http://citeseer.ist.psu.edu/directory.html) • ~400,000 terms (vocabulary size) • 17 classes • 2,912 documents • Instead of data (in PDF or PS), we collected BibTeX data (kind of metadata or catalogue) and abstracts of the articles. Masoud Makrehchi Preliminary Results
Data Set Masoud Makrehchi Preliminary Results