360 likes | 374 Views
LIACS Data Mining course. an introduction. Course Information. Course website: http://datamining.liacs.nl/DaMi/ (will be updated periodically) videos (EN and NL). Course Schedule. Start Tuesday Sept 3 Last lecture Dec 3 Lecture room varies
E N D
LIACS Data Mining course an introduction
Course Information • Course website: http://datamining.liacs.nl/DaMi/ (will be updated periodically) • videos (EN and NL)
Course Schedule • Start Tuesday Sept 3 • Last lecture Dec 3 • Lecture room varies • Pieter de la Court, room 1A20 (with the exception of Oct 1: room SC01) • Van Steenis, room F104 • Sept 17 (in two weeks), no lecture! • Oct 22, no lecture! • Two assignments, to be determined • Practical exercises • Oct 1, second hour • Oct 15, second hour • Exam: Jan 16, 14:00 – 17:00
Course Textbook Data Mining Practical Machine Learning Tools and Techniques third edition, Morgan Kaufmann, ISBN 978-0-12-374856-0 by Ian Witten and Eibe Frank (EUR 45,95 at Amazon.de)
Course participants? • Bachelor Informatica 45 • … & Economie 18 • … & Biologie 5 • Minor Data Science 14 • other Science Faculty 4 • other programmes 2 • PhD students 0 • others? 7 • 95
Introduction Data Mining an overview and some examples
Data Mining definitions Data Mining: the concept of extracting previously unknown and potentially useful, interesting knowledge from large sets of data secondary statistics: analyzing data that wasn’t originally collected for analysis
Data Mining, the big idea • Organizations collect large amounts of data • Often for administrative purposes • Large body of experience • Learning from experience • Goals • Prediction/forecasting • Diagnostics • Optimization • …
2 Streams • Knowledge Discovery • understanding a domain • interpretable models • examples: medicine, production, maintenance • examine details of model • Prediction • don’t care how you do it, just do it well • interpretable model? • black box • examples: marketing, forecasting (financial, weather) • apply model to new data
Data Mining customer model example: Direct Mail Optimize the response to a mailing, by targeting only those that are likely to respond: • more response • fewer letters Customer information response 3% test mailing Customer information response 30% final mailing remainder
example: Bioinformatics • Find genes involved in disease (Parkinson’s, Celiac, Neuroblastoma) • Measurements from patients (1) and controls (0) • Gene expression: measurements of 20k genes • dataset 20,001 x 100 • Challenges • many variables • few examples (patients), testing is expensive • interactions between genes
Taxonomy of the field Data Mining supervised unsupervised regression classification
Supervised(Classification and Regression) • Learning a ‘function’ from input to output • How does output depend on input? • Can we predict the output, given the input? target predictive model
Taxonomy of the field Data Mining supervised unsupervised regression classification target is numeric target is nominal
Taxonomy of the field Data Mining supervised unsupervised outlier detection pattern mining clustering community detection frequent patterns
Data Mining paradigms • Classification • (binary) class variable • predict class of future cases • most popular paradigm • Regression • numeric target variable • Clustering • divide dataset into groups of similar cases • Frequent Patterns/Association • find dependencies between variables • basket analysis, …
0.64 Yes Age < 35 0.4 Rent 0.25 No Age≥ 35 0.51 Yes Price < 200K 0.1 Buy 0.01 No Price≥ 200K 0.07 No Other Classification (decision tree) Predict the class (often 0/1) of an object on the basis of examples of other objects (with a class given). 0.2
Yes Age < 35 Rent No Age≥ 35 Yes Price < 200K Buy No Price≥ 200K No Other Applying a classifier (decision tree) New customer: (House = Rent, Age = 32, …) prediction = Yes
Yes Age < 35 Rent No Age≥ 35 Yes Price < 200K Buy No Price≥ 200K No Other Classification • Tree makes attribute dependencies explicit • Class depends on House • The influence of other attributes is less • Dependencies are (often) fuzzy • multiple attributes are needed • Perfect predictions are rare
y + + + + + + + x < t - + - + y < t’ - + - x t + - - - y t’ + - 0 x Graphical interpretation • dataset with two attributes + 1 class (+/-) • graphical interpretation of decision tree
y + + + + + + + - + - + - + - + - - - + - 0 x Graphical interpretation • dataset with two attributes + 1 class (+/-) • other classifiers Support Vector Machine Neural Network
Applications of DM • Marketing • outgoing • incoming • Bioinformatics & Medicine • Fraud detection • Risk management • Insurance • Enterprise resource planning
Training Data Speed Skating • Speed skating team LottoNL-Jumbo • Detailed historic data • training details • duration • intensity • competition results • Finding patterns of effective training • Visualise data
Kjeld Nuis • 178 races • On average 2.89% above track record • Specialises on 1000 m (2.1%) • 2015-2016 • Dutch champion 1000 m, 1500 m • WC Distances: bronze 1000 m, silver 1500 m • WC Sprint: ‘silver’ • ISU World Cup: gold 1000 m, silver 1500 m • 2017-2018 • Olympic Champion 1000 mand 1500 m
Total sum of load over last 5 days, morning sessions undesired result due to over-training advised upper limit
InfraWatch: monitoring of infrastructure Continuous monitoring of a large bridge ‘Hollandse Brug’ • 145 sensors • time-dependent, at frequencies up to 100 Hz • multi-modal (sensor, video, different freq.) • managing large data quantities, >5 Gb per day
InfraWatch: monitoring of infrastructure • 34 geo-phones (vibration sensors) • 44 embedded strain-gauges, 47 gauges outside • 20 thermometers • video camera • weather station
Maintenance planning at KLM • Routine checks of aircraft • Maintenance requires up to 10k different parts • Ordering parts incurs delay (costs)… • … but so does keeping stock • In theory 10k individual predictions • Input • maintenance history • flight history, Sahara/North Pole • Only few parts predictable
Cashflow Online • Online personal finance overview • All bank transactions are loaded into the application • transactions are classified into different categories • Data Mining predicts category
67 Categories Gas Water Licht Onderhoud huis en tuin Telefoon + Internet + TV Contributie (sport-)verenigingen Levensverzekering / Lijfrente Rente ontvangen Boodschappen Hypotheekrente Naar spaarrekening Geldopname/chipknip Verzekeringen overig Loterijen Cadeau's Interne boeking Vakantie & Recreatie Uitgaan, hobby's en sport Creditcard Ziektekostenverzekering Brandstof Woonhuis / Opstalverzekering Huishouden overig School- en Studiekosten Inkomsten overig Kleding & Schoenen Lenen Openbaar vervoer/Taxi …
Fragmented results: Boodschappen (groceries) Contributie
Decision Tree over all categories true false
Data Mining at LIACS • Applications • bioinformatics (LUMC) • Sports Analytics (LottoNL-Jumbo, PSV) • Hollandse Brug (Strukton, TU Delft, RWS) • fraud detection at Achmea health insurance and NZa • ChartEx, medieval documents (English, Latin) • Complex data • graphical data (molecules) • relational data (criminal careers) • stream data (sensor data, click streams) • …