1 / 79

Probability & Statistical Inference Lecture 1

Probability & Statistical Inference Lecture 1. MSc in Computing (Data Analytics). Lecture Outline. Introduction General Info Questionnaire Introduction to Statistics Statistics at work The Analytics Process Descriptive Statistics & Distributions Graphs and Visualisation. Introduction.

dallon
Download Presentation

Probability & Statistical Inference Lecture 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probability & Statistical Inference Lecture 1 MSc in Computing (Data Analytics)

  2. Lecture Outline • Introduction • General Info • Questionnaire • Introduction to Statistics • Statistics at work • The Analytics Process • Descriptive Statistics & Distributions • Graphs and Visualisation

  3. Introduction • Name : Aoife D’Arcy • Email: aoife@theanalyticsstore.com • Bio: Managing Director and Chief Consultant at the Analytics Store, has degrees in statistics, computer science, and financial & industrial mathematics. With over 12 years of experience in analytics consultancy with major national and international companies in banking, finance, insurance, manufacturing and gaming; I have developed particular expertise in risk analytics, fraud analytics, and customer insight analytics. • Lecture Notes: Will be available online on www.comp.dit.ie/bmacnamee and later on webcourses

  4. Course Outline

  5. Exam & Assignment Exam • The end of term exam accounts for 60% of the overall mark Assignment • The assignment is worth 40% of the overall mark. • The assignment will be handed out in week 5 • Week 9’s class will be dedicated to working on the assignment.

  6. Software • SAS Enterprise Guide will be the software that will be used during the course.

  7. Recommended Reading Applied Statistics and Probability for Engineers John Wiley & SonsDouglas C. Montgomery Modelling Binary DataChapman & HallDavid Collett Probability and Statistics for Engineers and Scientists Pearson Education R.E. Walpole, R.H. Myers, S.L. Myers, K. Ye Probability and Random Processes Oxford University PressG. Grimmett & D. Stirzaker Statistical InferenceBrooks/ColeGeorge Casella

  8. Questionnaire

  9. Section 1: Statistics are everywhere

  10. We are bombarded with Statistics • http://www.irishtimes.com/newspaper/frontpage/2012/0918/1224324122326.html • http://www.irishtimes.com/newspaper/world/2012/0914/1224324008884.html • http://www.independent.ie/business/world/survey-names-oslo-the-worlds-priciest-city-ireland-ranks-27th-3229426.html

  11. The internet is full of interesting statistics http://www.usatoday.com/news/politics/twitter-election-meter

  12. Statistics can be misleading • An ad claimed: “9 Out of 10 Dentists prefer Colgate” • What is wrong with this statement? • Consider these complaints about airlines published in US News and World Report on February 5, 2001 • Can we conclude the United airlines has the worst customer service?

  13. Statistics in Everyday Life • With the increase in the amount of data available and advancement`s in the power of computers, statistics are being used more and more frequently Question: Is it good that statistics are used so much and what happens when statistics are misused?

  14. Statistics can be misleading

  15. Misinterpreted Statistics can be Devastating • In 1999 Sally Clarke was wrongly convicted of the murder of two of her sons. The case was widely criticised because of the way statistical evidence was misrepresented in the original trial, particularly by paediatrician Professor Sir Roy Meadow. • He claimed that, for an affluent non-smoking family like the Clarks, the probability of a single cot death was 1 in 8,543, so the probability of two cot deaths in the same family was around "1 in 73 million" (8543 × 8543). • What is wrong with this assumption?

  16. Video • http://www.youtube.com/watch?v=4TKbIidbyhk&feature=fvwrel

  17. Challenges • As an Analytics practitioner you will face a number of challenges: • Create insight from all available data (and there is lots of it) • Interpret statistic correctly • Communicate statistically driven insight in a way that is clearly understood

  18. Objective of this course • Give you a set of statistical skills to allow you, as an analytics practitioner, turn data into insight!!

  19. The Analytics Process & Statistics

  20. Section Overview • Statistics and Analytics • Introduction to CRISP

  21. Data Analytics Is Multidisciplinary Statistics Pattern Recognition Neurocomputing Data Warehousing Machine Learning AI Predictive Analytics Databases KDD

  22. Analytics Process

  23. Analytics Is A Lot Of Things What’s the best that can happen? Optimization What will happen next? Predictive modelling Predictive Analytics Forecasting/extrapolation What if these trends continue? Why is this happening? Statistical analysis Competitive advantage Alerts What actions are needed? Where exactly is the problem? Query/drill down Access & reporting How many, how often, where? Ad hoc reports What happened? Standard reports Degree of intelligence

  24. For this course we will concentrate on Statistical Analysis What’s the best that can happen? Optimization What will happen next? Predictive modelling Predictive Analytics Forecasting/extrapolation What if these trends continue? Why is this happening? Statistical analysis Competitive advantage Alerts What actions are needed? Where exactly is the problem? Query/drill down Access & reporting How many, how often, where? Ad hoc reports What happened? Standard reports Degree of intelligence

  25. CRISP-DM Evolution • Over 200 members of the CRISP-DM SIG worldwide • DM Vendors: SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, etc • System Suppliers/Consultants: Cap Gemini, ICL Retail, Deloitte & Touche, etc • End Users: BT, ABB, Lloyds Bank, AirTouch, Experian, etc • Crisp-DM 2.0 is due… CompleteinformationonCRISP-DMisavailableat: http://www.crisp-dm.org/

  26. CRISP-DM • Features of CRISP-DM: • Non-proprietary • Application/Industry neutral • Tool neutral • Focus on business issues • As well as technical analysis • Framework for guidance • Experience base • Templates for Analysis

  27. Business Understanding Data Understanding Data Preparation Data Deployment Modelling Evaluation

  28. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Determine Business Objectives BusinessUnderstanding Thisinitialphasefocusesonunderstandingtheprojectobjectivesandrequirementsfromabusinessperspective,thenconvertingthisknowledgeintoadataminingproblemdefinitionandapreliminaryplandesignedtoachievetheobjectives Assess Situation Determine Data Mining Goals Produce Project Plan

  29. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Collect Initial Data DataUnderstanding Thedataunderstandingphasestartswithaninitialdatacollectionandproceedswithactivitiesinordertogetfamiliarwiththedata,toidentifydataqualityproblems,todiscoverfirstinsightsintothedataortodetectinterestingsubsetstoformhypothesesforhiddeninformation. Describe Data Explore Data Verify Data Quality

  30. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Select Data DataPreparation Thedatapreparationphasecoversallactivitiestoconstructthedatathatwillbefedintothemodellingtoolsfromtheinitialrawdata.Datapreparationtasksarelikelytobeperformedmultipletimesandnotinanyprescribedorder.Tasksincludetable,recordandattributeselectionaswellastransformationandcleaningofdataformodellingtools. Clean Data Construct Data Integrate Data Format Data

  31. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Select Modeling Technique Modelling Inthisphase,variousmodellingtechniquesareselectedandappliedandtheirparametersarecalibratedtooptimalvalues.Typically,thereareseveraltechniquesforthesamedataminingproblemtype.Sometechniqueshavespecificrequirementsontheformofdata.Therefore,steppingbacktothedatapreparationphaseisoftennecessary. Generate Test Design Build Model Assess Model

  32. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Evaluation Beforeproceedingtofinaldeploymentofamodel,itisimportanttothoroughlyevaluateitandreviewthestepsexecutedtoconstructittobecertainitproperlyachievesthebusinessobjectives.Akeyobjectiveistodetermineifthereissomeimportantbusinessissuethathasnotbeensufficientlyconsidered.Attheendofthisphase,adecisionontheuseofthedataminingresultsshouldbereached. Evaluate Results Review Process Determine Next Steps

  33. Phases & Generic Tasks Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Deployment Creationofamodelisgenerallynottheendoftheproject.Evenifthepurposeofthemodelistoincreaseknowledgeofthedata,theknowledgegainedwillneedtobeorganizedandpresentedinawaythatthecustomercanuseit.Dependingontherequirements,thedeploymentphasecanbeassimpleasgeneratingareportorascomplexasimplementingarepeatabledataminingprocessacrosstheenterprise. Plan Deployment Plan Monitoring & Maintenance Produce Final Report Review Project

  34. Crisp - DM • Business Understanding • Data Understanding • Data Preparation • Modelling • Evaluation • Deployment.

  35. Crisp – DM – Areas covered in this course • Business Understanding • Data Understanding • Data Preparation • Modelling • Evaluation • Deployment

  36. Section 2: Descriptive Statistics & Distributions

  37. Topics • Introduction to Statistics • The Basics • Measures of location: Mean, Median & Mode. • Measures of location & Skew. • Measures of dispersion: range, standard deviation (variance) & interquartile range.

  38. Introduction to Statistics • According to The Random House College Dictionary, statistics is “the science that deals with the collection, classification, analysis and interpretation of numerical facts or data.” In short, statistics is the science of data. • There are two main branches of Statistics: • The branch of statistics devoted to the organisation, summarization and the description of data sets is called Descriptive Statistics. • The branch of statistics concerned with using sample data to make an inference about a large set of data is called Inferential Statistics.

  39. Process of Data Analysis • A Statistical population is a data set that is our target of interest. • A sample is a subset of data selected from the target population. • If your sample is not representative then it is referred to as being bias MakeInference Describe

  40. Types of Data: Numeric Data • Numeric data can be of two types: • Continuous Data: Data is continuous if it has an interval of real numbers for its range • The number of centimetres of rain that fell in March • Discrete Data: Data is defined as discrete if it has a finite range • The number of correct answers in a 10 question quiz

  41. Types of Data: Categorical Data • Data that is broken into discrete categories is referred to as categorical data • Categorical data has two main types: • Nominal: A nominal variable has a discrete number of categories or levels with no logical order • Gender: Male, Female • Working Status: Employed, Unemployed, Home-maker, Student, Retired • Ordinal: An ordinal variable has a discrete number of categories or levels with a logical order • Income Level: Low, Medium, High • Places in a race: 1st, 2nd, 3rd, 4th, 5th, 6th

  42. Class Task • Task: Classify the type of each of the data the following examples: • The profit margin made from customers of an online clothing company • The type of interest rate you can be charged on a mortgage i.e. Fixed rate, Adjustable rate • Number of dependents a associated with a loan applicant

  43. Let’s Start at the Very Beginning • When learning to read and write we start with A-B-C, when starting to count we start with 1-2-3 and of course The Von Trappe family singers started with Do-Re-Me! • When learning statistics you start with the arithmetic mean or a simple average

  44. The Arithmetic Mean • * Germany combines East and West Germany prior to reunification ** Russia or The Soviet Union • Data source http://www.databaseolympics.com/index.htm The table below shows the total medals won and gold medals won by each country in the last 5 Olympic games

  45. Arithmetic Mean – The Formula • The formula for calculating the sample arithmetic mean of n data points x1, x2 ..... xn: is referred to as x-bar

  46. Attributes of the Arithmetic Mean • It is straight-forward to calculate • It is easy to interpret the mean • It gives us a good estimate of where a set of numbers is centred • This is referred to as the central tendency of a sample • It is sensitive to outliers

  47. Other Measures of Central Tendency • Median:The middle value of an ordered set of values, i.e. 50% higher and 50% lower • Mode:The most commonly occurring value in a distribution

  48. Calculating the Median Sort the data Median = 97.5

  49. Calculating the Mode Count frequencies Mode = 94

  50. When to Use Each Central Tendency Value? • Question: When and why would you use the median over the mean?

More Related