280 likes | 291 Views
Introduction to R and Data Mining. Chapters 1 and 2. Recommendations by Amazon.com. What is Business Intelligence/Analytics?. Different names knowledge discovery (mining) in databases (KDD), knowledge extraction, data/ pattern analysis/analytics, information harvesting,
E N D
Introduction to R and Data Mining Chapters 1 and 2
What is Business Intelligence/Analytics? • Different names • knowledge discovery (mining) in databases (KDD), • knowledge extraction, • data/pattern analysis/analytics, • information harvesting, • business intelligence, • business analytics, • (when you overdo it) data dredging/data fishing/snooping • And same content • Extraction of potentially useful (yet previously unknown) patterns or knowledge from large volumes of data.
Data dredging/data fishing/snooping (i.e., when you overdo/over-interpret it …) • http://www.businessweek.com/magazine/correlation-or-causation-12012011-gfx.html
The most important concept – Pattern/Intelligence Pattern 1: Search Item = Bones Searcher = dog It’s highly likely that you are a dog. On the Internet, no one knows who I am.
Patterns • Pattern is the summarization of data • Relationship, regularity and structure hidden in the data • Examples of pattern • On Thursday nights people who buy diapers also tend to buy beer • People with good credit ratings are less likely to have accidents • Male consumers, 37+, income bracket 50K-75K spend between $25-$50 per catalog order • We are interested in patterns that have business value
Revisiting • Would you qualify the following as pattern analysis/data dredging? • Predicting price of cars from dealership data regarding purchases, car features, customers, etc. • Predicting price of cars from likes on Facebook • Car price changes based on car company bankruptcies • Car price changes based on airline fare hikes
Why Mine Data? – Because of Data Explosion • The data explosion problem • Mature database technology, cheap storage, the Internet a lot of data accumulated in databases, data warehouses, and other information repositories • The complete works of Shakespeare: 5 Megabytes • A good collection of the works of Beethoven: 20 Gigabytes • The print collections of the US Library of Congress: 10 Terabytes • All printed material, ever: 200 Petabytes • New data generated in 2007 on Earth: 2.5 Exabytes • All words ever spoken by human beings: 5 Exabytes • Data rich, yet knowledge poor
Why Mine Data? – Because We Now Can • Data often already exists • Purchases at department/grocery stores (i.e., scanner data, market basket data) • Bank/Credit Card transactions • Web data, e-commerce • Cheap and powerful computers & the Internet make collecting and crunching gigantic datasets possible • Data warehousing and data mining techniques have made remarkable advancements in recent years thus enabling the extraction of information from data
Why Mine Data? – Because It Creates Value • Because it creates value for consumers and businesses • Amazon.com: “Customers Who Bought This Item Also Bought …” • Because your competitors are doing this • Racing to provide better, customized services that today’s consumers expect (e.g., Netflix crushing Blockbuster, Amazon crushing Barns and Noble after Borders, …) • But this value is far from fully realized! • To date, only small portion (~5-10%) of collected data is analyzed • And existing analysis leaves much for improvement
BI Example 1 – What goes with What? Market Basket Analysis i.e., Association Rules
BI Example 2 - Churning • Financial Express article on Verizon • What was Verizon’s churn rate prior to adopting BI? Why does that matter? • Instead of trying to retain an old customer, why not just focus on getting new ones? • If customer retention is important, why not try to retain all who are about to leave? • What did Verizon do? What was the result? • -- The idea of Classification
BI Example 3 – Clustering Towns • Boston Globe’s home market consists of 200+ towns • Around 2003 the newspaper planned to introduce local editions • One for each town was too excessive • Intuitively, “similar” towns can be grouped together to share a same local edition • How to measure similarity? Population, avg income, avg house value, education level, ethnicity, … • The newspaper eventually decided 12 editorial zones. Local editions proved to be a business success. • The idea of Clustering (we will spend a lot time on this)
Example 4 – Credit scoring • Reasons a home mortgage loan is rejected: • FICO score is not good enough • Already having too much debt • House to purchase is likely over-valued • … • Why all the hassle? To root out high-default-risk applicants • What data mining technique works here? • Related example: risk management in insurance • Classification
Example 5 – The Goldcorp Challenge • In 1999, Goldcorp had difficulty mining gold in its 50-year-old 55,000 acre mine in Red Lake, Ontario • In 2000, it launched the “Goldcorp Challenge” with a total of $575,000 prize available to participations with best methods. • It shared data (geological and mining history) that goes back to 1948. • Stunned industrial peers as sharing proprietary geological data to the public is unthinkable before. http://www.ideaconnection.com/open-innovation-success/Open-Innovation-Goldcorp-Challenge-00031.html
Example 5 – The Goldcorp Challenge (Cont.) • 70 submission were received from geologist, mathematicians, military officers, physicists and the winner - data miners. • CEO Rob McEwen “… there were capabilities I had never seen before in the industry. When I saw the computer graphics I almost fell out of my chair.” • Then, 8 million ounces of gold were found • The market capital grew from $100M to $9B!
Example 6 - The Netflix Prize • October 2, 2006 to September 21, 2009 • Objective: predict any user’s rating of any movie • $1 million prize for a 10% improvement over Netflix’s (then) current movie recommender (Cinematch) • The winner: Belkor’s Pragmatic Chaos
Looking Back at These Examples (and more) • Customers • Targeting new ones, retaining existing ones, • identify high-margin customers, high-risk ones, • Products/services • What goes with what, what to recommend (and to whom) • Competition • Monitor competitors (or referees) and market directions • Reports • Summarization, visualization • Non-business applications • Politics, security & terrorism, crime (e.g., money laundering)
How can Analytics Help? • Summarization • Summary statistics • You should already know this • Won’t be covered here • Association Rules • What items are commonly bought together? • Clustering • What cohesive groups of customers do we have? • Classification • How likely is a customer to respond to a marketing campaign? • Categorical or Numerical • Sequence analysis • Time-series analysis, sequential associations • Others: visualization, deviation detection, …
Business Analytics Source: Advanced Analytics: Predictive, Collaborative, and Pervasive, by Gartner
More on Functionalities • Data mining tasks tend to be descriptive or predictive. • Descriptive Characterizes properties of the data • Who are my best customers? • What items are purchased together? • Which groups of customer have similar buying patterns? • Usually use unsupervised learning • Predictive Makes inferences from data for prediction • Classifying loan applicants into risk categories • Predicting which customers are likely to leave • Predicting direction in which price of a stock will change • Predicting who will be the next US president • Usually use supervised learning • Prescriptive We don’t cover!!
Revisiting Classify into descriptive and predictive • Labeling each existing customer as potential buyer or non-buyer for a new product • Determining a price to offer a customer based on customer’s profile • Grouping all existing customers into three potential groups (and label them later)
Prediction vs. Classification • Prediction: Determining a value (continuous) • Price of car • Demand of iPhone • Customer’s worth for a company in a life time • Classification: Determining categories • Potential buyer vs. non-buyer • Credit defaulter vs. non-defaulter • White vs. black. • Classification is a subset of prediction (predicting categories)
Take Action • The whole point is to take action… • Knowledge discovered from mining data allows for informed decisions. • Examples • Contact targeted customers • Verizon reduced churn rate from 2% to 1.5% per month • Prioritizing customer service • Cingular and AT&T were fined for $1.5 million on Sept. 10, 2004 for discriminating their services based on customers’ credit rating.
BI Tool Vendors • SAS (including SAS Enterprise Miner) • RapidMiner • XLMiner • Weka • SAP Business Warehouse • R • IBM SPSS Modeler (Formerly SPSS Clementine) • Microsoft SQL Server Analysis Services • Oracle Data Mining (formerly, 'Darwin') • … • www.kdnuggets.com provides extensive information on BI. Check it out! (http://www.kdnuggets.com/polls/2013/analytics-big-data-mining-data-science-software.html)
Take Away’s • BI techniques • Descriptive vs. predictive • Prediction vs. classification • BI process
Chapters and Sections to Read • Chapter 1 complete • Chapter 2: Section 2.1 to 2.3