Data Mining Term Project

Data Mining Term Project Machine Learning with WEKA Weka Explorer Tutorial for Version 3.4.3 Svetlana S. Aksenova Department of Computer Science California State University, Sacramento Fall 2004

Machine learning methods for data mining • use techniques from computer science, statistics and probability, and data visualization to search for patterns and relationships in large data sets • Allow automatically analyze a large amount of data • The result of analysis automatically makes predictions faster and more accurately • The result of analysis makes decisions faster and more accurately

About WEKA • Developed by University of Waikato in New Zealand • open source software issued under the GNU General Public License • WEKA is a data mining system written in Java • implements data mining algorithms • compatible with most of computer platforms • applied to the dataset by choosing either command line or graphic user interface

Introduction to the Tutorial • Created to help in learning process • Consists of 8 parts: • Introduction • Launching WEKA • Preprocessing Data • Building “Classifiers” • Clustering Data • Finding Associations • Attribute Selection • Data Visualization

Launching WEKA GUI Chooser – the Main Menu

Preprocessing • Data can be read from a • Local filesystem (in ARFF, CSV, C4.5, binary formats) • URL • SQL database (using JDBC) • File conversion • Preprocessing window • Preprocessing tools - “filters”

File Conversion CSV ARFF Excel

Open File (from the local filesystem)

Open File (from a website) http://gaia.ecs.csus.edu/~aksenovs/ weather.arff

Preprocessing Window

Setting Filters • WEKA contains filters for discretization, normalization, resampling, attribute selection, transformation and combination of attributes. • Some techniques, such as association rule mining, can only be performed on categorical data.

Filter Configuration Options Right-click on on filter

Building “Classifiers” • Choosing a classifier J48 (C4.5)

Setting Test Options

Output the Result Used weather data in “weather.arff” for classification

Analyzing Results

Visualizing Results

Tree Visualizer

Error Visualizer

Error Visualizer (cont’d)

Exercise • Given at the end of the section Classification Exercise Use ID3 algorithm to classify weather data from the “weather.arff” file. Perform initial preprocessing and create a version of the initial dataset in which all numeric attributes should be converted to categorical data.

Clustering Data The clustering schemes available in WEKA are k-Means, EM, Cobweb, X-means, FarthestFirst. Used customer data for clustering in “customers.arff”

Clustering Data (cont’d) • Choosing clustering scheme • K- means • 5 clusters • Setting test options • Analyzing results

Visualizing Results

Results of Clustering in ARFF File

Exercise • Given at the end of the section Clustering Exercise Use k-means algorithm to bank data from the “bank.arff” file. Perform initial preprocessing and create a version of the initial data set in which the ID field should be removed and the "children" attribute should be converted to categorical data.

Finding Associations • Apriori • works only with discrete data • identifies statistical dependencies between groups of attributes • used grocery store data from “grocery.arff” file with confidence 40% and support 30%. • Setting test options • Analyzing Results

Exercise • Given at the end of the section Association Rules Exercise Use Apriori algorithm to generate association rules for Iris data from the “iris.arff” file. Perform initial preprocessing and create a version of the initial data set in which the numeric attributes should be converted to categorical data.

Attribute Selection • searches through all possible combinations of attributes • finds which subset of attributes works best for prediction. • contain two parts: • a search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking, • evaluation method: correlation-based, wrapper, information gain, chi-squared. • used weather data from “weather.arff” file

Attribute Selection (cont’d)

Data Visualization • visualize a 2-D plot of the current working relation • determine difficulty of the learning problem

Data Visualization (cont’d)

Selecting Instances • A group of points on the graph can be selected in • four ways: • Select Instance • Rectangle • Polygon • Polyline

Select Instance

Rectangle

Polygon

Polyline

Why should we use WEKA • You can solve a machine learning problem with a minimum programming • WEKA includes • reading of data, • implementation of filtering, • result evaluation

Performance • Has not been evaluated in this project • Can it process large ARFF files (GB)? • An answer has been found in “wekalist” • It can process some schemes that are • either incrementally trainable or can be • made to be.

Future Work • Has not been done due to time constraints • ‘Simple CLI’ provides a simple command-line interface and allows direct execution of Weka commands. • ‘KnowledgeFlow’ is a Java-Beans-based interface for setting up and running machine learning experiments.

References I. Witten, E. Frank, Data Mining, Practical Machine. Learning Tools and Techniques with Java Implementation, Morgan Kaufmann Publishers, 2000. 2. R. Kirkby, WEKA Explorer User Guide for version 3-3-4, University of Weikato, 2002. 3. Weka Machine Learning Project, http://www.cs.waikato.ac.nz/~ml/index.html. Machine Learning With WEKA, E.Frank, University of Waikato, New Zealand. 5. B. Mobasher, Data Preparation and Mining with WEKA, http://maua.cs.depaul.edu/~classes/ect584/WEKA/association_rules.html, DePaul University, 2003. 6. M. H. Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.

Data Mining Term Project

Data Mining Term Project

Presentation Transcript

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data structures Term Project CHECKERS

Data Mining Final Project

Data Mining and Project Definition

DATA MINING PROJECT

Data Mining Project

Data Mining: Data

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data