210 likes | 373 Views
What is Data Mining ?. By Saurabh Jain. General Concept of Data Mining. Most organization have accumulated a great deal of data, but, what they really want is information Data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. .
E N D
What is Data Mining ? By Saurabh Jain
General Concept of Data Mining • Most organization have accumulated a great deal of data, but, what they really want is information • Data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.
General Concept of Data Mining (cont’d) • Data mining uses sophisticated statistical analysis and modeling techniques to uncover patterns and relationship hidden in organization database. • Data mining is one of several terms, including knowledge discovery, knowledge extraction, data archaeology, information harvesting and even data dredging
Algorithms Associations Classifications Sequential discovery Clustering Technologies Neural networks Decision trees Rule induction Data visualization How Does Data Mining Work?
Algorithms 1. Associations - This is used to identify items that occur together in a given event or record. - This technique is often used for market analysis, rules hidden between the attributes Ex. “When people buy a hammer they also buy nails 50% of the time.”
Algorithms (cont’d) 2. Classifications This is used to classify database records into a number of predefined classes based on certain criteria. Ex. “Customers with excellent credit history have a debt/equity ratio of less than 10%”
Algorithms (cont’d) 3. Sequential Discovery This helps identify patterns in time series. Ex. “60% of customers buy TVs followed by 8mm camcorders.”
Algorithms (cont’d) 4. Clustering This is used to segment the database into different clusters, based on a set of attributes. Ex. “Understand the target market used to classify new data”
Technologies 1. Neural networks This trains the net on a training dataset and then use it to make predictions.
Technologies (cont’d) 2. Decision trees • A way of representing a series of rule that lead to value or class. • This segregates the data based on the value of the variables.
Technologies (cont’d) 2. Decision Trees(cont’d) - Cannot use continuous data - Used for model understanding rather than prediction
Technologies (cont’d) 3. Rule induction All possible patterns in the database are systematically pulled out and then the accuracy and the coverage are calculated. Ex. IF breakfast cereals, then milk : accuracy 90%, coverage 15% IF Friday and male and diapers, then beer: accuracy 60%, coverage 0.1%
Technologies (cont’d) 4. Data visualization • Graphics tools are used to illustrate data relationships. • Gain a deeper, intuitive understanding of the data by presenting a picture for users.
Current Limitations • Cost, Time and Effort • Data Mining setup can be expensive • Many man-hour of development are needed. • Some of their functions involve steep learning curves for the end-users. • Extensive training and practices are still needed for most users
Current Limitations (cont’d) • Low-end software • These have limited query capabilities and its inability to perform multidimensional analyses. • Many of the current methods are not truly interactive and cannot incorporate prior knowledge • Large databases • The large size presents problems in terms of finding efficient algorithms for association rules.
Conclusion • Data mining assists user finding patterns and relationships in the data. • Data mining is a powerful tool, not magic. • Organize the large volumes of data into some form of categories. => To avoid the GIGO, data should have minimal missing values. • Current Limitations : Cost, Time, effort, Low-end software, and Large databases