1 / 28

Big Data Technologies Lecture 1: Big Data & Big Data Analysis

Big Data Technologies Lecture 1: Big Data & Big Data Analysis. Assoc. Prof . Marc FRÎNCU , PhD. Habil . marc.frincu@e-uvt.ro. Lecture Structure. 1 lecture hour + 2 lab hours / week Aim of this lecture ? Importance of Big Data anal ysis Impact of Big Data in science and technology

carter
Download Presentation

Big Data Technologies Lecture 1: Big Data & Big Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Big Data TechnologiesLecture 1:Big Data & Big Data Analysis Assoc. Prof.Marc FRÎNCU, PhD. Habil. marc.frincu@e-uvt.ro

  2. Lecture Structure • 1 lecture hour + 2 lab hours / week • Aim of this lecture? • Importance of Big Data analysis • ImpactofBig Data in science and technology • Paralleland distributed architectures • Parallelizationof algorithms • Importance of hardwareand datastructurein the designof Big Data processing algorithms • Analisys ofindependent, dependentand streaming data (homogeneous and heterogeneous) • Practically (laborator) • Using Google Cloud to run parallel and distributed applications • Paralelizingsequentialbasic sequential algorithms • Design, test, evaluation

  3. Minimal Requirements • Passing grade (5) • 1 parallel algorithm implemented(in a single technology) and evaluated • One scientific presentation (10 minsthe presentation + 2 questions) about ascientific paper (published or tech report) with focus on Big Data, Cloud Computing, Bioinformatics, Security in Big Data. • Maximum grade (10) • All algorithms given during lab hours should be implemented, and a final technical report should be presented • One scientific presentation (10 minsthe presentation + 2 questions) about a top scientific paper (IPDPS, Supercomputing, Europar, CCGrid, ICDCS, IEEE Trans. PDC, IEEE Trans. Computing, FGCS, TPDS) with focus on Big Data, Cloud Computing, Bioinformatics or Security in Big Data.

  4. A world increasingly connected Je Suis Charlie: 6500 retweets per minute

  5. A world increasingly interconnected Cyberphysical systems: IT + communication + intelligence

  6. Knowledge = Power = data Data: decision  control autonomy intelligence

  7. What is Big Data? • Oxford English Dictionary (OED) • data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges • Wikipedia • an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications • datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze • The ability of society to harnessinformation in novel ways to produce useful insightsor goods and services of significant value” and “…things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value. • The broad range of new and massivedata typesthat have appeared over the last decade or so • The new tools helping us find relevantdata and analyze its implications • The convergence of enterprise and consumer IT • The shift (for enterprises) from processing internal data to mining external data • The shift (for individuals) from consuming data to creatingdata. • The merger of Madame OlympeMaxime and Lieutenant Commander Data • The belief that the more data you have the more insights and answers will rise automatically from the pool of ones and zeros • A new attitude by businesses, non-profits, government agencies, and individuals that combining data from multiple sources could lead to better decisions. https://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/#66e783be13ae

  8. What is Big Data? • Volume • Velocity • Variety • ...

  9. What is Big Data? Furthermore, Big Data means: • Using multiple data sources • Data ambiguities and human/machine errors Big Data != Better Data Unprocessed data has no meaning! Data analysis increases their value → information!

  10. Big Data in numbers

  11. Big Data in numbers

  12. Big Data in numbers

  13. Big Data in the current global context

  14. Why now? • ”We could have gotten started a lot earlier. We simply weren’t stepping back and looking at how to use the data” – Brad Smith, Intuit • Dataare too preciousto be erased!

  15. What do we do with the data? Ethics! • Private data • Sensible data

  16. Information extraction • Exploratory • Theory based on observing phenomena • Constructive • Theory based on axioms and theorems

  17. The 4th paradigm • Big Data + analysis • Prediction of the future • Analysis • Follows an exploratorypath and studiesdata • Infersknowledge based on statisticsormachine learningtechniques • Constructsmodelsand validatesthem based on data

  18. Data analysis • The process of studying various types of data and to identify so far unknown correlations and to extract otheruseful information • Based on data mining Data flow

  19. Types of data analytics • Descriptive • What has happened? • Diagnostics • Why it happened? • Predictive • What will happen? • Prescriptive • What should we do with the data? Level of understanding data & its value

  20. Few examples • Medical monitoring of children to alert doctors and parents when an intervention is needed • Predicting the status of industrial machinery • Preventing traffic jams, saving fuel, cutting down pollution

  21. Data value

  22. Data analysis flow • Data Acquisition • Datacleaning, annotation andextraction • Missing values, outliers, duplicates • Between 50-70% of the effort is put here! • Heterogeneousdata integration andrepresentation in a common format • Dataanalysis • Automatedandvisualinterpretationof results • People often see patterns that algorithms fail to identify! • Decisionmaking

  23. Big Data roles • Data scientist • Data science = systematic method dedicated to uncovering knowledge through data analysis • In business • Process optimizations for increased efficiency • In science • Observed/experimental data analysis with the aim of drawing a conclusion • Requirements • Statistics • Java, Python, R, .... • Domain knowledge • Data engineer • Data engineering= field that designs, implements and offers systems for Bog Data analysis • Builds scalable and modular platforms for data scientists • Installs Big Data solutions • Requirements • databases, software engineering, parallel and cloud processing, real-time processing • C++, Java, Python • Understating performance factors and the limitations of the systems/algorithms

  24. Areas of interest

  25. Data vs. processing speed • Data • Annotated: L • Unannotated: U • Learning algorithm: Φ • f = Φ(L + U) • Minimize error function • Avoid over training • Results: • Scalability: • Supervised learning: f = Φ(L) • Training data is large but insufficient!!! • Semisupervised learning f = Φ(L* + U) • L* most relevant training data • L* + U is large • Unsupervised learning: f = Φ(U) • Nearest Neighbor, convolutionary neural networks, unrestricted Bolzmann machines, Deep Learning

  26. Practical exampleClassification in DNA microarray studies • Classifying and prediction of the diagnosis based on the gene profile • Measuring gene expression on a sample of 4,026 genesfrom 59 patients(39 used for training) exhibiting lymphoma and divided in 3 classesbased on the type of the lymphoma • Problem • Few classes, hard to classify data (volume) • Algorithm • Find the centroid (mean expression of each gene) of each lymphoma • Find the genes that belong to it http://statweb.stanford.edu/~tibs/ftp/ncshrink2.pdf

  27. Useful links • http://www.comp.nus.edu.sg/~tankl/cs5344/slides/2016/intro.pdf • http://infolab.stanford.edu/~echang/BigDat2015/BigDat2015-Lecture1-Edward-Chang.pdf • https://wr.informatik.uni-hamburg.de/_media/teaching/wintersemester_2015_2016/bd-1516-einfuehrung.pdf • https://www.ee.columbia.edu/~cylin/course/bigdata/EECS6893-BigDataAnalytics-Lecture1.pdf

  28. Next lecture • Parallel and distributed architectures • Parallel systems • Shared memory • Distributed memory • Distributed systems • Cloud computing

More Related