1 / 46

Introduction to Statistical Learning and Practical Applications

Join our course to explore statistical learning concepts, problems, and implementations using R and data analysis tools. Learn supervised and unsupervised learning techniques. Enhance your knowledge with hands-on practice and assessment. Book your session now!

rmichelle
Download Presentation

Introduction to Statistical Learning and Practical Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1-Pembelajaran Mesin19 Agustus 2015 Perkenalan dan silabus Pengantar Statistical Learning Problems Praktikum: R studio, R basic, Data frame & basic plot

  2. Perkenalan Singkat • Dosen: Hapnes Toba (lt. 8 GWM) • Kontak: hapnestoba@gmail.com / 0818.0905.7037 • Riset: • Information retrieval • Text analysis • Information extraction • Machine learning • Data mining

  3. Waktu Perkuliahan • Teori: Rabu, 7:30-09:45 (istirahat 15 menit) • Praktikum: Rabu, 10:00-12:00 • Ruang: Lab. Adv. 3 • Kesepakatan: • Tepat waktu (15 menit toleransi) • Gadget non-aktif: emergency silakan ke luar kelas dulu • Silakan buat janji, jika ingin diskusi di luar sesi perkuliahan • Kejujuran akademis: copas, nyontek, dkk sejenisnya harap dihindari

  4. Penilaian dan Silabus (dibagikan) • Teori • Praktikum: R statistical language, analisis data melalui kasus (2-3 orang), plus point: Java / C# (R-interfacing)

  5. Lain-lain • Setiap kelompok (2-3 orang), wajib membuat folder khusus melalui cloud (Google Drive, Microsoft OneDrive, GitHub atau lainnya) untuk tugas-tugas yang diberikan. Setiap kali pengumpulan wajib membagikan tautan folder tersebut kepada dosen. • Tugas dan materi perkuliahan dapat diperoleh melalui http://sitoba.itmaranatha.org • Terkait waktu pengumpulan tugas-tugas, harap memperhatikan jadwal yang telah diinformasikan. Keterlambatan memiliki konsekuensi pemotongan nilai 25% per hari keterlambatan (4 hari keterlambatan berarti nilai = 0). • Penyalinan jawaban antar kelompok, baik itu sebagian ataupun keseluruhan akan berakibat nilai 0 untuk tugas yang bersangkutan dan berlaku untuk semua kelompok yang terlibat. • Kuis dapat berupa materi teori maupun pemrograman, dan akan diinformasikan di kelas.

  6. Buku Pegangan dan Referensi • An Introduction to Statistical Learning: With Applications in R (ISLR). (2014). James, G., Witten, D., & Hastie, T. Springer. ISBN: 978-1-4614-7137-0. Website:www.StatLearning.com

  7. Statistics in the news

  8. Statistics in the news

  9. Statistics in the news

  10. Statistical Learning Problems:Identify the risk factors for prostate cancer

  11. Classify a recorded phoneme based on a log-periodogram

  12. Predict whether someone will have a heart attack on the basis of demographic, diet and clinical measurements

  13. Customize an email spam detection system

  14. Identify the numbers in a handwritten zip code

  15. Classify a tissue sample into one of several cancer classes, based on a gene expression prole.

  16. Establish the relationship between salary and demographic variables in population survey data

  17. Classify the pixels in a LANDSAT image, by usage

  18. The Supervised Learning Problem • Outcome measurement Y (also called dependent variable, response, target). • Vector of p predictor measurements X (also called inputs, regressors, covariates, features, independent variables). • In the regression problem, Y is quantitative (e.g price, blood pressure). • In the classication problem, Y takes values in a nite, unordered set (survived/died, digit 0-9, cancer class of tissue sample). • We have training data (x1; y1); …; (xN; yN). These are observations (examples, instances) of these measurements.

  19. Objectives • On the basis of the training data we would like to: • Accurately predict unseen test cases. • Understand which inputs aect the outcome, and how. • Assess the quality of our predictions and inferences.

  20. Philosophy • It is important to understand the ideas behind the various techniques, in order to know how and when to use them. • One has to understand the simpler methods first, in order to grasp the more sophisticated ones. • It is important to accurately assess the performance of a method, to know how well or how badly it is working [simpler methods often perform as well as fancier ones!] • This is an exciting research area, having important applications in science, industry and nance. • Statistical learning is a fundamental ingredient in the training of a modern data scientist.

  21. Unsupervised learning • No outcome variable, just a set of predictors (features) measured on a set of samples. • objective is more fuzzy: • find groups of samples that behave similarly, • find features that behave similarly, • find linear combinations of features with the most variation. • difficult to know how well your are doing. • different from supervised learning, but can be useful as a pre-processing step for supervised learning.

  22. Statistical Learning versus Machine Learning • Machine learning arose as a subeld of Artificial Intelligence. • Statistical learning arose as a subeld of Statistics. • There is much overlap - both elds focus on supervised and unsupervised problems: • Machine learning has a greater emphasis on large scale applications and prediction accuracy. • Statistical learning emphasizes models and their interpretability, and precision and uncertainty. • But the distinction has become more and more blurred, and there is a great deal of “cross-fertilization”. • Machine learning has the upper hand in Marketing!

  23. ISLR Premises • Many statistical learning methods are relevant and useful in a wide range of academic and non-academic disciplines, beyond just the statistical sciences. • Statistical learning should not be viewed as a series of black boxes. • While it is important to know what job is performed by each cog, it is not necessary to have the skills to construct the machine inside the box! • We presume that the reader is interested in applying statistical learning methods to real-world problems

  24. Notation and Simple Linear Algebra

  25. ISLR and MASS Library / Package

  26. R Introduction • R is the GNU of S. S is a language for statisticians developed at Bell Laboratories by John Chambers et al. • R is designed by John Chambers and developed by the R Foundation. • R is a language and environment for statistical computing and graphics • R is the de facto standard to develop statistical software • R implements variety of statistical and graphical techniques (linear and nonlinear modeling, statistical tests, time series analysis, classication, clustering, ...)

  27. R Provides • Efective data handling and storage • Operators for calculations on arrays (matrices) • A large, coherent, integrated collection of intermediate tools for data analysis • Graphical facilities for data analysis and display • Simple and efective programming language (conditionals, loops, user difined recursive functions) • Extension mechanism with a large collection of packages

  28. Why R • R is Open-Source and free to use (http://cran.r-project.org/) • R has a large and active community • R provides state-of-the-art algorithm (> 3000 extension packages on CRAN, 2011) • R creates beautiful visualizations (as seen in the New York Times and The Economist) • R is used widely in industry (Revolution oers commercial solutions) • R can be easily paralellized • R is getting ready for big data (Revolution Analytics)

  29. R Studio (http://www.rstudio.com/) Environment / Workspace Console Working dir / plots/ packages / help / viewer

  30. R Objects • R handles everything as objects. You can store any numbers or characters in an R object your name. For example, > a <- 5 • We now have an object called a which contains the number 5. We can display the content of an object simply by typing its name: > a • If we want to use characters, they should be specified using either single or double quotes. For example, > myname <- “Tom.Smith” • Note that R is case sensitive, and you cannot use spaces or special characters for object names. When necessary, it is recommendable to use . (period) to connect words. • R stores all objects you create in the workspace (i.e., R console). You can remove R objects by using rm(). > rm(myname) • Now, if we call myname, we will get an error message. > myname

  31. R Vectors • You can assign a set of numbers or characters (i.e. a vector) to an R object, using the concatenate command, c(). > b <- c(-1, 0, 1.5, 30, -12) > my.team <- c(“Laura”, “Bob”, “Megan”) • As before, we can get the content of the object by typing its name b. We can also extract individual elements of a vector, using square brackets to indicate the index that we want. So, x[i] returns i-th element in vector x: > b[3] > my.team[2] • Note that numbers and characters cannot be mixed in the same vector; otherwise, R will assume all elements are characters. • If you want to create a numeric vector with a sequence of numbers, use : (colon). > c <- 1:5

  32. Data Frames • R treats data sets as data frames. • A data frame is like a simple spreadsheet in Excel. • Each column consists of the same object type (numeric or character), with the same number of rows.

  33. Importing Data (1) • Before importing data into the workspace, let’s remove all objects in the workspace. • To list all objects, type ls(). • To remove all objects, type rm(list = ls()). • Even after you removed all objects, R console still remembers all commands you typed during the session until you shut down R. Press the up arrow key to see the previous commands.

  34. Importing Data (2) • First, let's create a data file. You can type in data and create data frames in the R console, but it is much easier to input data in Excel and save it as comma-separated files (.csv), and then load it on the R console. • R reads csv files as well as simple text files (.txt), but it does not read Excel files (.xls, .xlsx). • Let's create the following table in Excel (including column names) and save it as a csv file (you can name it whatever you want).

  35. Create Data *.csv

  36. Import Data • To import the csv file you have just created in the workspace, type: > mydata <- read.csv("directory/to/your/file", header=T) • Or, if you want to select your file by using Windows Explorer (on Windows) or Finder (on Mac), > mydata <- read.csv(file.choose(), header=T) • Now your file is assigned an object name mydata. The header statement (T/F, or TRUE/FALSE) indicates whether the first line of the file is the column names. The default in read.csv() is header=T, so it can be left out if it is the case.

  37. Data Frame Basics (1) • We can select a certain part of a data frame. For example, type: > mydata$score > mydata[1,2] > mydata[1:3,2] > mydata[,3] • $(dollar) specifies a component (column name) of the data frame. • [1,2] specifies the element on the 1st row on the 2nd column. • Similarly, [1:3,2] specifies the elements from the 1st to the 3rd row on the 2nd column, and [,3] specifies all rows on the 3rd column.

  38. Data Frame Basics (2) • You can extract subsets of data frames that meet conditions using logical operators. > subdata <- mydata[mydata$sex == "M", ] > subdata <- mydata[mydata$height > 170, ] • Here, we gave a new object name to the subset of the data, so that it can be used as an object. • To specify the condition, we use logical operators, such as == (equal to), < (less than), <= (less than or equal to), > (greater than), >= (geater than or equal to), != (not equal to). • You can delete rows and columns from a data frame. You can overwrite the same data frame by assigning the same name, but it is wiser to create a new data frame when you modify elements. > subdata <- mydata[-2,] # delete the 2nd row

  39. Data Frame Basics (3) • You can change elements in data frames. For example, > newdata <- mydata > newdata[5,3] <- 153.1 • will overwrite the element on the 5th row, 3rd column in data frame newdata. • In the same way, you can add rows and columns to data frames, using $ (dollar) in the existing data frame. > newdata <- mydata > newdata$age <- c(19, 20, 20, 18, 25)

  40. First look at data • This data set contains morphological measurements on 200 purple rock crabs (Campbell & Mahon 1974, Aus J Zool 22; 417-). The data contain the following columns: • colour: body colour types ("blue" or "orange") • sex: male ("M") or female ("F") • FL: frontal lobe length (mm) • RW: rear width (mm) • CL: carapace length (mm) • CW: carapace width (mm) • BD: body depth (mm)

  41. mean(mydata$FL) • var(mydata$FL) • sd(mydata$FL) • length(mydata$FL) • quantile(mydata$FL, c(0.05, 0.95)) • sum() • max() • min() • median() • range() Try and record your results • mydata <- read.csv(file.choose(), header=T) • head(mydata) • names(mydata) • nrow(mydata) • ncol(mydata) • str(mydata) • summary(mydata) If you don’t know how to use functions, for example: mean(), you can call help by typing help(mean), or simply ?mean.

  42. Histogram • Visualising data is the best way to start investigating them. • First, let’s see the distribution of Frontal Lobe Lengths by creating a histogram. > hist(mydata$FL) • R automatically chooses the number of intervals (bins), but you can change it. > hist(mydata$FL, breaks=50) • You can plot a smooth distribution of the variable (Kernel density estimates) using a function density(). > plot(density(mydata$FL))

  43. Box plot • Now let’s see the Rear Widths of male and female crabs using a box plot. > boxplot(RW ~ sex, data = mydata) • The box is constructed using the interquartile range (25-75%), and the thick line in the middle is the median value. • Whiskers are constructed using the range of the variable (after removing extreme values), and open circles indicate extreme values (no extreme values in this data set).

  44. Scatter Plot • A scatter plot is the convenient way to see the relationship between two variables. • For example, to plot the Carapace Width on the x-axis and the corresponding Body Depth on the y-axis, type > plot(BD ~ CW, data = mydata)

  45. Changing appearances • You can change appearances of the graph by adding arguments in a function plot(). • Here are some useful arguments. • xlim, ylim: range of the x-axis and the y-axis • xlab, ylab: labels for the x-axis and the y-axis • main: main title of the graph • pch: symbol type. See ?points. • cex: symbol size • col: symbol colour by either colour name or index. See colors().

  46. Plot Function > plot(BD ~ CW, data= mydata, main= "Purple rock crabs", xlim=c(20,40), xlab = "Carapace width (mm)", ylab= "Length (mm)", pch=19, col= "blue", cex=0.7) • You can add points and a legend in the existing plot. > points(RW ~ CW, data=mydata, pch=24, col="red", cex=0.7) > legend(20, 20, legend=c("Body depth", "Rear width"), pch=c(19, 24), col=c("blue","red")) • The first two argument in function legend() is x and y positions of the legend.

More Related