1 / 16

Data Mining I

Data Mining I. Jagdish Gangolly State University of New York at Albany. Data Mining . What is Data mining? Data mining primitives Task-relevant data Kinds of knowledge to be mined Background knowledge Interestedness measures Visualisation of discovered patterns Query language.

more
Download Presentation

Data Mining I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining I Jagdish Gangolly State University of New York at Albany Acc 522 Fall 2001 (Jagdish S. Gangolly)

  2. Data Mining • What is Data mining? • Data mining primitives • Task-relevant data • Kinds of knowledge to be mined • Background knowledge • Interestedness measures • Visualisation of discovered patterns • Query language Acc 522 Fall 2001 (Jagdish S. Gangolly)

  3. Data Mining • Concept Description (Descriptive Datamining) • Data generalisation • Data cube (OLAP) approach (offline pre-computation) • Attribute-oriented induction approach (online aggregation) • Presentation of generalisation • Descriptive Statistical Measures and Displays Acc 522 Fall 2001 (Jagdish S. Gangolly)

  4. What is Data mining? • Discovery of knowledge from Databases • A set of data mining primitives to facilitate such discovery (what data, what kinds of knowledge, measures to be evaluated, how the knowledge is to be visualised) • A query language for the user to interactively visualise knowledge mined Acc 522 Fall 2001 (Jagdish S. Gangolly)

  5. Data mining primitives I • Task-relevant data: attributes relevant for the study of the problem at hand • Kinds of knowledge to be mined: characterisation, discrimination, association, classification, clustering, evolution,… • Background knowledge: Knowledge about the domain of the problem (concept hierarchies, beliefs about the relationships, expected patterns of data, …) Acc 522 Fall 2001 (Jagdish S. Gangolly)

  6. Data mining primitives II • Interestedness measures: support measures (prevalence of rule pattern) and confidence measures(strength of the implication of the rule) • Visualisation of discovered patterns: rules, tables, charts, graphs, decision trees, cubes,… Acc 522 Fall 2001 (Jagdish S. Gangolly)

  7. Task-relevant Data Steps: • Derivation of initial relation through database queries (data retrieval operations). (Obtaining a minable view) • Data cleaning & transformation of the initial relation to facilitate mining • Data mining Acc 522 Fall 2001 (Jagdish S. Gangolly)

  8. Kinds of knowledge to be mined • Kinds of knowledge & templates (meta-patterns, meta-rules, meta-queries) • Association An Example: age(X:customer, W) Λ income(X, Y)  buys(X, Z) • Classification • Discrimination • Clustering • Evolution analysis Acc 522 Fall 2001 (Jagdish S. Gangolly)

  9. Background knowledge • Knowledge from the problem domain • usually in the form of • concept hierarchies (rolling up or drilling down) • schema hierarchies (lattices) • set-grouping hierarchies (successive sub-grouping of attributes) • rule-based hierarchies Acc 522 Fall 2001 (Jagdish S. Gangolly)

  10. Interestedness measures I • Simplicity: More complex the structure, the more difficult it is to interpret, and so likely to be less interesting (rule length,…) • Certainty: Validity, trustworthiness # tuples containing both A and B confidence(AB)  # tuples containing A Sometimes called “certainty factor” Acc 522 Fall 2001 (Jagdish S. Gangolly)

  11. Interestedness measures II • Utility: Support is the percentage of task-relevant data tuples for which the pattern is true # tuples containing both A and B support(AB)  total # tuples Acc 522 Fall 2001 (Jagdish S. Gangolly)

  12. Visualisation of discovered patterns • Hierarchies • tables • pie/bar charts • dot/box plots • …… Acc 522 Fall 2001 (Jagdish S. Gangolly)

  13. Descriptive Datamining (Concept Description & Characterisation ) • Concept description:Description of data generalised at multiple levels of abstraction • Concept characterisation: Concise and succinct summarisation of a given collection of data • Concept comparison: Discrimination Acc 522 Fall 2001 (Jagdish S. Gangolly)

  14. Data Generalisation • Abstraction of task-relevant high conceptual level data from a database containing relatively low conceptual level data • Data cube (OLAP) approach (offline pre-computation) (Figs 2.1 & 2.2, pages 46 &47) • Attribute-oriented induction approach (online aggregation) • Presentation of generalisation (Tables 5.3 & 5.4 on p. 191, and Figs 5.2, 5.3, & 5.4 on pages 192 & 193) Acc 522 Fall 2001 (Jagdish S. Gangolly)

  15. Descriptive Statistical Measures and Displays I • Measures of central tendency • Mean, Weighted mean (weights signifying importance or occurrence frequency) • Median • Mode • Measures of dispersion • Quartiles, outliers, boxplots Acc 522 Fall 2001 (Jagdish S. Gangolly)

  16. Descriptive Statistical Measures and Displays II • Displays • Histograms (Fig 5.6, page 214) • Barcharts • Quantile plot (Fig 5.7, page 215) • Quantile-Quantile plot (Fig 5.8, page 216) • Scatter plot (Fig 5.9, page 216) • Loess curve (Fig 5.10, page 217) Acc 522 Fall 2001 (Jagdish S. Gangolly)

More Related