1 / 15

Data mining in Wikipedia

Data mining in Wikipedia. 2011-09-26 Sinchoo Kim. Terms. Data Unorganized and unprocessed fact Information Data that are processed to be useful Provides answers to "who", "what", "where", and "when" questions Knowledge Application of data and information Answers "how" questions. KDD.

jace
Download Presentation

Data mining in Wikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data mining in Wikipedia 2011-09-26 Sinchoo Kim

  2. Terms • Data • Unorganized and unprocessed fact • Information • Data that are processed to be useful • Provides answers to "who", "what", "where", and "when" questions • Knowledge • Application of data and information • Answers "how" questions

  3. KDD • KDD (Knowledge Discovery in Database) • Describes the process of automatically searching large volumes of data that can be considered knowledge about the data for patterns • Selection • Preprocessing • Transformation • Data Mining • Interpretation/Evaluation

  4. Data mining • Definition • The analysis step of the Knowledge Discovery in Databases process • Discovering previously unknown pattern • Example • Home equity loan

  5. Case : Home equity loan • Select subset of customer records who have received home equity loan offer

  6. Case : Home equity loan • Find rules to predict whether a customer would respond to home equity loan offer note or note IF (Salary < 40k) and (numChildren > 0) and (ageChild1 > 18 and ageChild1 < 22) THEN YES

  7. Case : Home equity loan • Group customers into clusters and investigate clusters Group 3 Group 2 Group 1 Group 4

  8. Case : Home equity loan • Evaluate results • Many “uninteresting” clusters • One interesting cluster! Customers with both business and personal accounts; unusually high percentage of likely respondents

  9. Common classes of tasks • Association rule learning • Searches for relationship between variables • Clustering • Discover groups and structures in data are in some way similar • Anomaly detection • Identification of unusual data records

  10. Common classes of tasks • Classfication • Generalizing known structure to apply to new data • Regressions • Find a function which models the data with the least error • Summarizations • Provide a more compact representation of the data set

  11. Notable uses • Business • Customer management • Marketing • Identify purchase pattern • In human resource department • Identifying the characteristics of their most successful employees • In Decision making support • Integrated-circuit production line

  12. Notable uses • Science and engineering • Human genetics • Relation between genetics and deseases • Electrical power engineering • Detect abnormal conditions • Estimate the nature of the abnormalities

  13. Notable uses • Visual Data Mining • Large data set have been generated, collected, and stored • Find trends and information which is hidden in data set

  14. Issues • Reliable data set • Overfitting • Training set which are not present in the general data set

  15. Issues • Privacy concerns and ethics • The term data mining has no ethical implications • Compiled data cause anyone who has access • to the newly compiled data set • to be able to identify specific individuals, especially when originally the data were anonymous

More Related