1 / 42

Data Mining Term Project

Data Mining Term Project. Machine Learning with WEKA Weka Explorer Tutorial for Version 3.4.3 Svetlana S. Aksenova Department of Computer Science California State University, Sacramento Fall 2004. Machine learning methods for data mining.

derek-finch
Download Presentation

Data Mining Term Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Term Project Machine Learning with WEKA Weka Explorer Tutorial for Version 3.4.3 Svetlana S. Aksenova Department of Computer Science California State University, Sacramento Fall 2004

  2. Machine learning methods for data mining • use techniques from computer science, statistics and probability, and data visualization to search for patterns and relationships in large data sets • Allow automatically analyze a large amount of data • The result of analysis automatically makes predictions faster and more accurately • The result of analysis makes decisions faster and more accurately

  3. About WEKA • Developed by University of Waikato in New Zealand • open source software issued under the GNU General Public License • WEKA is a data mining system written in Java • implements data mining algorithms • compatible with most of computer platforms • applied to the dataset by choosing either command line or graphic user interface

  4. Introduction to the Tutorial • Created to help in learning process • Consists of 8 parts: • Introduction • Launching WEKA • Preprocessing Data • Building “Classifiers” • Clustering Data • Finding Associations • Attribute Selection • Data Visualization

  5. Launching WEKA GUI Chooser – the Main Menu

  6. Preprocessing • Data can be read from a • Local filesystem (in ARFF, CSV, C4.5, binary formats) • URL • SQL database (using JDBC) • File conversion • Preprocessing window • Preprocessing tools - “filters”

  7. File Conversion CSV ARFF Excel

  8. Open File (from the local filesystem)

  9. Open File (from a website) http://gaia.ecs.csus.edu/~aksenovs/ weather.arff

  10. Preprocessing Window

  11. Setting Filters • WEKA contains filters for discretization, normalization, resampling, attribute selection, transformation and combination of attributes. • Some techniques, such as association rule mining, can only be performed on categorical data.

  12. Filter Configuration Options Right-click on on filter

  13. Building “Classifiers” • Choosing a classifier J48 (C4.5)

  14. Setting Test Options

  15. Output the Result Used weather data in “weather.arff” for classification

  16. Analyzing Results

  17. Visualizing Results

  18. Tree Visualizer

  19. Error Visualizer

  20. Error Visualizer (cont’d)

  21. Exercise • Given at the end of the section Classification Exercise Use ID3 algorithm to classify weather data from the “weather.arff” file. Perform initial preprocessing and create a version of the initial dataset in which all numeric attributes should be converted to categorical data.

  22. Clustering Data The clustering schemes available in WEKA are k-Means, EM, Cobweb, X-means, FarthestFirst. Used customer data for clustering in “customers.arff”

  23. Clustering Data (cont’d) • Choosing clustering scheme • K- means • 5 clusters • Setting test options • Analyzing results

  24. Visualizing Results

  25. Results of Clustering in ARFF File

  26. Exercise • Given at the end of the section Clustering Exercise Use k-means algorithm to bank data from the “bank.arff” file. Perform initial preprocessing and create a version of the initial data set in which the ID field should be removed and the "children" attribute should be converted to categorical data.

  27. Finding Associations • Apriori • works only with discrete data • identifies statistical dependencies between groups of attributes • used grocery store data from “grocery.arff” file with confidence 40% and support 30%. • Setting test options • Analyzing Results

  28. Exercise • Given at the end of the section Association Rules Exercise Use Apriori algorithm to generate association rules for Iris data from the “iris.arff” file. Perform initial preprocessing and create a version of the initial data set in which the numeric attributes should be converted to categorical data.

  29. Attribute Selection • searches through all possible combinations of attributes • finds which subset of attributes works best for prediction. • contain two parts: • a search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking, • evaluation method: correlation-based, wrapper, information gain, chi-squared. • used weather data from “weather.arff” file

  30. Attribute Selection (cont’d)

  31. Data Visualization • visualize a 2-D plot of the current working relation • determine difficulty of the learning problem

  32. Data Visualization (cont’d)

  33. Selecting Instances • A group of points on the graph can be selected in • four ways: • Select Instance • Rectangle • Polygon • Polyline

  34. Select Instance

  35. Rectangle

  36. Polygon

  37. Polyline

  38. Why should we use WEKA • You can solve a machine learning problem with a minimum programming • WEKA includes • reading of data, • implementation of filtering, • result evaluation

  39. Performance • Has not been evaluated in this project • Can it process large ARFF files (GB)? • An answer has been found in “wekalist” • It can process some schemes that are • either incrementally trainable or can be • made to be.

  40. Future Work • Has not been done due to time constraints • ‘Simple CLI’ provides a simple command-line interface and allows direct execution of Weka commands. • ‘KnowledgeFlow’ is a Java-Beans-based interface for setting up and running machine learning experiments.

  41. References I. Witten, E. Frank, Data Mining, Practical Machine. Learning Tools and Techniques with Java Implementation, Morgan Kaufmann Publishers, 2000. 2. R. Kirkby, WEKA Explorer User Guide for version 3-3-4, University of Weikato, 2002. 3. Weka Machine Learning Project, http://www.cs.waikato.ac.nz/~ml/index.html. Machine Learning With WEKA, E.Frank, University of Waikato, New Zealand. 5. B. Mobasher, Data Preparation and Mining with WEKA, http://maua.cs.depaul.edu/~classes/ect584/WEKA/association_rules.html, DePaul University, 2003. 6. M. H. Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.

More Related