1 / 28

Introduction to R and Data Mining

Introduction to R and Data Mining. Chapters 1 and 2. Recommendations by Amazon.com. What is Business Intelligence/Analytics?. Different names knowledge discovery (mining) in databases (KDD), knowledge extraction, data/ pattern analysis/analytics, information harvesting,

tuckerc
Download Presentation

Introduction to R and Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to R and Data Mining Chapters 1 and 2

  2. Recommendations by Amazon.com

  3. What is Business Intelligence/Analytics? • Different names • knowledge discovery (mining) in databases (KDD), • knowledge extraction, • data/pattern analysis/analytics, • information harvesting, • business intelligence, • business analytics, • (when you overdo it) data dredging/data fishing/snooping • And same content • Extraction of potentially useful (yet previously unknown) patterns or knowledge from large volumes of data.

  4. Data dredging/data fishing/snooping (i.e., when you overdo/over-interpret it …) • http://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.html

  5. The most important concept – Pattern/Intelligence Pattern 1: Search Item = Bones  Searcher = dog It’s highly likely that you are a dog. On the Internet, no one knows who I am.

  6. Patterns • Pattern is the summarization of data • Relationship, regularity and structure hidden in the data • Examples of pattern • On Thursday nights people who buy diapers also tend to buy beer • People with good credit ratings are less likely to have accidents • Male consumers, 37+, income bracket 50K-75K  spend between $25-$50 per catalog order • We are interested in patterns that have business value

  7. Revisiting • Would you qualify the following as pattern analysis/data dredging? • Predicting price of cars from dealership data regarding purchases, car features, customers, etc. • Predicting price of cars from likes on Facebook • Car price changes based on car company bankruptcies • Car price changes based on airline fare hikes

  8. Why Mine Data? – Because of Data Explosion • The data explosion problem • Mature database technology, cheap storage, the Internet  a lot of data accumulated in databases, data warehouses, and other information repositories • The complete works of Shakespeare: 5 Megabytes • A good collection of the works of Beethoven: 20 Gigabytes • The print collections of the US Library of Congress: 10 Terabytes • All printed material, ever: 200 Petabytes • New data generated in 2007 on Earth: 2.5 Exabytes • All words ever spoken by human beings: 5 Exabytes • Data rich, yet knowledge poor

  9. Why Mine Data? – Because We Now Can • Data often already exists • Purchases at department/grocery stores (i.e., scanner data, market basket data) • Bank/Credit Card transactions • Web data, e-commerce • Cheap and powerful computers & the Internet make collecting and crunching gigantic datasets possible • Data warehousing and data mining techniques have made remarkable advancements in recent years thus enabling the extraction of information from data

  10. Why Mine Data? – Because It Creates Value • Because it creates value for consumers and businesses • Amazon.com: “Customers Who Bought This Item Also Bought …” • Because your competitors are doing this • Racing to provide better, customized services that today’s consumers expect (e.g., Netflix crushing Blockbuster, Amazon crushing Barns and Noble after Borders, …) • But this value is far from fully realized! • To date, only small portion (~5-10%) of collected data is analyzed • And existing analysis leaves much for improvement

  11. Data Mining Examples in Real World

  12. BI Example 1 – What goes with What? Market Basket Analysis i.e., Association Rules

  13. BI Example 2 - Churning • Financial Express article on Verizon • What was Verizon’s churn rate prior to adopting BI? Why does that matter? • Instead of trying to retain an old customer, why not just focus on getting new ones? • If customer retention is important, why not try to retain all who are about to leave? • What did Verizon do? What was the result? • -- The idea of Classification

  14. BI Example 3 – Clustering Towns • Boston Globe’s home market consists of 200+ towns • Around 2003 the newspaper planned to introduce local editions • One for each town was too excessive • Intuitively, “similar” towns can be grouped together to share a same local edition • How to measure similarity? Population, avg income, avg house value, education level, ethnicity, … • The newspaper eventually decided 12 editorial zones. Local editions proved to be a business success. • The idea of Clustering (we will spend a lot time on this)

  15. Example 4 – Credit scoring • Reasons a home mortgage loan is rejected: • FICO score is not good enough • Already having too much debt • House to purchase is likely over-valued • … • Why all the hassle? To root out high-default-risk applicants • What data mining technique works here? • Related example: risk management in insurance • Classification

  16. Example 5 – The Goldcorp Challenge • In 1999, Goldcorp had difficulty mining gold in its 50-year-old 55,000 acre mine in Red Lake, Ontario • In 2000, it launched the “Goldcorp Challenge” with a total of $575,000 prize available to participations with best methods. • It shared data (geological and mining history) that goes back to 1948. • Stunned industrial peers as sharing proprietary geological data to the public is unthinkable before. http://www.ideaconnection.com/open-innovation-success/Open-Innovation-Goldcorp-Challenge-00031.html

  17. Example 5 – The Goldcorp Challenge (Cont.) • 70 submission were received from geologist, mathematicians, military officers, physicists and the winner - data miners. • CEO Rob McEwen “… there were capabilities I had never seen before in the industry. When I saw the computer graphics I almost fell out of my chair.” • Then, 8 million ounces of gold were found • The market capital grew from $100M to $9B!

  18. Example 6 - The Netflix Prize • October 2, 2006 to September 21, 2009 • Objective: predict any user’s rating of any movie • $1 million prize for a 10% improvement over Netflix’s (then) current movie recommender (Cinematch) • The winner: Belkor’s Pragmatic Chaos

  19. Looking Back at These Examples (and more) • Customers • Targeting new ones, retaining existing ones, • identify high-margin customers, high-risk ones, • Products/services • What goes with what, what to recommend (and to whom) • Competition • Monitor competitors (or referees) and market directions • Reports • Summarization, visualization • Non-business applications • Politics, security & terrorism, crime (e.g., money laundering)

  20. How can Analytics Help? • Summarization • Summary statistics • You should already know this • Won’t be covered here • Association Rules • What items are commonly bought together? • Clustering • What cohesive groups of customers do we have? • Classification • How likely is a customer to respond to a marketing campaign? • Categorical or Numerical • Sequence analysis • Time-series analysis, sequential associations • Others: visualization, deviation detection, …

  21. Business Analytics Source: Advanced Analytics: Predictive, Collaborative, and Pervasive, by Gartner

  22. More on Functionalities • Data mining tasks tend to be descriptive or predictive. • Descriptive Characterizes properties of the data • Who are my best customers? • What items are purchased together? • Which groups of customer have similar buying patterns? • Usually use unsupervised learning • Predictive  Makes inferences from data for prediction • Classifying loan applicants into risk categories • Predicting which customers are likely to leave • Predicting direction in which price of a stock will change • Predicting who will be the next US president • Usually use supervised learning • Prescriptive We don’t cover!!

  23. Revisiting Classify into descriptive and predictive • Labeling each existing customer as potential buyer or non-buyer for a new product • Determining a price to offer a customer based on customer’s profile • Grouping all existing customers into three potential groups (and label them later)

  24. Prediction vs. Classification • Prediction: Determining a value (continuous) • Price of car • Demand of iPhone • Customer’s worth for a company in a life time • Classification: Determining categories • Potential buyer vs. non-buyer • Credit defaulter vs. non-defaulter • White vs. black. • Classification is a subset of prediction (predicting categories)

  25. Take Action • The whole point is to take action… • Knowledge discovered from mining data allows for informed decisions. • Examples • Contact targeted customers • Verizon reduced churn rate from 2% to 1.5% per month • Prioritizing customer service • Cingular and AT&T were fined for $1.5 million on Sept. 10, 2004 for discriminating their services based on customers’ credit rating.

  26. BI Tool Vendors • SAS (including SAS Enterprise Miner) • RapidMiner • XLMiner • Weka • SAP Business Warehouse • R • IBM SPSS Modeler (Formerly SPSS Clementine) • Microsoft SQL Server Analysis Services • Oracle Data Mining (formerly, 'Darwin') • … • www.kdnuggets.com provides extensive information on BI. Check it out! (http://www.kdnuggets.com/polls/2013/analytics-big-data-mining-data-science-software.html)

  27. Take Away’s • BI techniques • Descriptive vs. predictive • Prediction vs. classification • BI process

  28. Chapters and Sections to Read • Chapter 1 complete • Chapter 2: Section 2.1 to 2.3

More Related