1 / 18

Adopting Big-Data Computing Across the Undergraduate Curriculum

Adopting Big-Data Computing Across the Undergraduate Curriculum. Bina Ramamurthy ( B ina ) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina This talk is partially funded by NSF grant NSF-TUES-0920335 & by AWS in Education Coursework Grant award. Outline of the talk.

shay
Download Presentation

Adopting Big-Data Computing Across the Undergraduate Curriculum

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adopting Big-Data Computing Across the Undergraduate Curriculum Bina Ramamurthy (Bina) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina This talk is partially funded by NSF grant NSF-TUES-0920335 & by AWS in Education Coursework Grant award Symposium on Big Data Science and Engineering

  2. Outline of the talk • Golden era in computing • Big Data computing curriculum • Data-intensive/Big Data Computing Certificate program at University at Buffalo • Outcome Evaluation • Important Findings • Recommendations for adoption into undergraduate curriculum • Demos • Useful links and project web page • Question and Answers Symposium on Big Data Science and Engineering

  3. A Golden Era in Computing Symposium on Big Data Science and Engineering

  4. Top Ten Largest Databases Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world/ Symposium on Big Data Science and Engineering

  5. Top Ten Largest Databases in 2007 vs Facebook ‘s cluster in 2010 21 PetaByte In 2010 Facebook Ref: http://www.comparebusinessproducts.com/fyi/10-largest-databases-in-the-world Symposium on Big Data Science and Engineering

  6. Data Deluge: smallest to largest Bioinformatics data: from about 3.3 billion base pairs in a human genome to huge number of sequences of proteins and the analysis of their behaviors The internet: web logs, facebook, twitter, maps, blogs, etc.: Analytics … Financial applications: that analyze volumes of data for trends and other deeper knowledge Health Care: huge amount of patient data, drug and treatment data The universe: The Hubble ultra deep telescope shows 100s of galaxies each with billions of stars Symposium on Big Data Science and Engineering

  7. Different Type of Storage • Internet introduced a new challenge in the form web logs, web crawler’s data: large scale “peta scale” • But observe that this type of data has an uniquely different characteristic than your transactional or the “customer order” data, or “bank account data” : • The data type is “write once read many (WORM)” ; • Privacy protected healthcare and patient information; • Historical financial data; • Other historical data • Relational file system and tables are insufficient. • Large <key, value> stores (files) and storage management system. • Built-in features for fault-tolerance, load balancing, data-transfer and aggregation,… • Clusters of distributed nodes for storage and computing. • Computing is inherently parallel Symposium on Big Data Science and Engineering

  8. Big-data Concepts • Originated from the Google File System (GFS) is the special <key, value> store • HadoopDistributed file system (HDFS) is the open source version of this. (Currently an Apache project) • Parallel processing of the data using MapReduce (MR) programming model • Challenges • Formulation of MR algorithms • Proper use of the features of infrastructure (Ex: sort) • Best practices in using MR and HDFS • An extensive ecosystem consisting of other components such as column-based store (Hbase, BigTable), big data warehousing (Hive), workflow languages, etc. Symposium on Big Data Science and Engineering

  9. Data & Analytics • We have witnessed explosion in algorithmic solutions. • “In pioneer days they used oxen for heavy pulling, when one couldn’t budge a log they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” Grace Hopper • What you cannot achieve by an algorithm can be achieved by more data. • Big data if analyzed right gives you better answers: Center for disease control prediction of flu vs. prediction of flu through “search” data 2 full weeks before the onset of flu season! (see the reference) Symposium on Big Data Science and Engineering

  10. The Cloud Computing • Cloud is a facilitator for Big Data computing and is an indispensable in this context • Cloud provides processor, software, operating systems, storage, monitoring, load balancing, clusters and other requirements as a service • Cloud offers accessibility to Big Data computing • Cloud computing models: • platform (PaaS), Microsoft Azure • software (SaaS), Google App Engine (GAE) • infrastructure (IaaS), Amazon web services (AWS) • Services-based application programming interface (API) Symposium on Big Data Science and Engineering

  11. Big-data Courses • We introduced the concepts in a two course (sequence): • Course 1 (has become a core course): CSE 486 • Foundational concepts of MapReduce and Hadoop distributed file system is introduced as a part of the Distributed System course • The last project/lab in the distributed systems course is based on MapReduce (MR) concepts, and is implemented on HDFS • Course 2 (has become an elective course):CSE 487 • The second course focuses completely on Big-data issues and mostly MR algorithm formulation and best practices • Analytics on large clusters, and on the cloud • Text book we use for this course is deals with algorithms and data structures for MR and best ways to leverage the parallelism in MR family of operations (map, reduce, combine, partition, etc.) • Other Big Databases such as Hbase and Hive and workflows are also introduced Symposium on Big Data Science and Engineering

  12. Big-data Certificate Program • Official name is Data-intensive computing certificate • Initiated with support from NSF TUES program • Approved by SUNY system in Fall 2011 • Offered by the University • For the enrolled undergraduates--- Any major! • Details of the program • CS1, CS2 • Distributed system (CSE486) : Pre-req CS2 • Data-intensive computing system (CSE487) Pre-req CS2 • An elective in the discipline of choice (Ex: BIO4XY or MGS4XY) • A capstone project applying data-intensive computing (Ex:BIO499or MGS499) Symposium on Big Data Science and Engineering

  13. Evaluation • Did the students learn? • Enrollment in the courses? • Enrollment in the certificate program? • Availability of resources • Pedagogical challenges (student background, text books, professional preparation of teachers for big data) • Employability of the students? Symposium on Big Data Science and Engineering

  14. Findings • There is high demand from student for “data-intensive” and big data computation related courses • Certificate program is hard for non-CSE majors • High demand for big-data skills from employers • Educators and administrators need to be educated about big-data (Remember the times we educated people on Object-orientation, Java, web-enabling etc.) • It is imperative we improve the preparedness of our workforce at all levels for Big Data skills for global competitiveness. • Just one course or a single certificate is NOT enough: we need continuous and repeated exposure to Big Data in various contexts. • It is often very hard to create and sustain a new curriculum • How can we address these challenges? Symposium on Big Data Science and Engineering

  15. Recommendations • Introduce big-data concepts as integral part of UG curriculum • For example for CS, simple word-count of big-data in CS1, map-reduce algorithm in CS2, cloud storage and big-table in Database systems, Hadoop in distributed systems, the entire big-data analytics in other elective courses such as Machine Learning and Data Mining. • Use compelling examples using real world datasets • Train the educators: big-data professional development for the academic core is critical • Expose the administrators: to use of Big Data applications/tools in all possible areas: institutional analysis, data collected at various educational institutions is a gold-mine for macro-level analytics; “What is the trend?” “Are they learning?” • Train the counselors who advise high school students, and college entry level counselors • Include the community colleges and four years colleges • Need investment from major industries (mentoring, educator days, etc.) Symposium on Big Data Science and Engineering

  16. Demos • Simple word count using MR model on HDFS on the local machine • Foundation for many algorithms such as word cloud • Simple and easy to understand • Project Guttenberg • Simple co-occurrence analysis of twitter data • Twitter has donated the entire collection of tweets to Library of Congress • Amazon MR workflow and working AWS facilities • Finally sample run of 10miilion node tree of a compute cluster on the Center for Computational Research (CCR)at Buffalo Symposium on Big Data Science and Engineering

  17. Summary • We explored the need for data-intensive or big-data computing • We illustrated Big Data concepts and demonstrated the cloud capabilities through simple applications • Data-intensive computing on the cloud is an essential and indispensable skill for the workforce of today and tomorrow • University at Buffalo has implemented a SUNY-wide a Certificate Program in Data-intensive Computing • Actionable thing we could do is form a group of people passionate about Big Data and work at introducing it in their courses/projects Symposium on Big Data Science and Engineering

  18. References & useful links • Flu prediction reference: J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinskiand L. Brilliant. Detecting influenza epidemics using search engine query data, Nature 457, 1012-1014 (19 February 2009): http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html • Twitter and Library of Congress: A. Watters. How Library of Congress is Building the Twitter Archive. http://radar.oreilly.com/2011/06/library-of-congress-twitter-archive.html, last viewed July 2012. • Project web page for all the project material including course description, course material, project description, several presentations, useful links, and references http://www.cse.buffalo.edu/faculty/bina/DataIntensive Symposium on Big Data Science and Engineering

More Related