1 / 39

Security Data Science (SDS)

ENEE 759D | ENEE 459D | CMSC 858Z. Security Data Science (SDS). Prof. Tudor Dumitraș. Assistant Professor, ECE University of Maryland, College Park. http://ter.ps/ 759d https://www.facebook.com/SDSAtUMD. Introducing Your Instructor. Tudor Dumitraș Office: AVW 3425

lavonn
Download Presentation

Security Data Science (SDS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENEE 759D | ENEE 459D | CMSC 858Z Security Data Science (SDS) Prof. Tudor Dumitraș Assistant Professor, ECEUniversity of Maryland, College Park http://ter.ps/759d https://www.facebook.com/SDSAtUMD

  2. Introducing Your Instructor Tudor Dumitraș Office: AVW 3425 Email: tdumitra@umiacs.umd.edu Course Website: http://ter.ps/759d Office Hours: Mon 2-3 pm

  3. My Background • Ph.D. at Carnegie Mellon University • Research in distributed systems and fault-tolerant middleware • Worked at Symantec Research Labs • Built WINE platform for Big Data experiments in security • WINE currently used by academic researchers and Symantec engineers • Joined UMD faculty • Research and teaching on applied security and systems • Focus on solving security problems with data analysis techniques WINE

  4. SDS In A Nutshell • Course objectives • Ability to understandand interpretscholarly publications, to explaintheir key ideas, and to provide constructive feedback • Ability to applysome of these ideas in practice • Topics • Grading • 50% paper reviewsand class participation • 50% projects

  5. We Are Swimming in Data • Data created/reproduced in 2010: 1,200 exabytes • Data collected to find the Higgs boson: 1 gigabyte / s • Yahoo: 200 petabytes across 20 clusters • Security: • Global spam in 2011: 62 billion / day • Malware variants created in 2011: 403 million

  6. Why So Much Data? • We can store it • 6¢ / GB • 29¢ / GB (SAS HDD) • We can generate it • Most data is machine-generated • Most malware samples are variants of other malware, generated automatically (repacking, obfuscation) What to do with all this data?

  7. Three Stories about Data

  8. What Questions to Ask on a First Date?The Power of Big Data One

  9. If You Want to Know … Do my date and I have long-term potential?

  10. If You Want to Know … Do my date and I have long-term potential? … ask: 275,000 user submitted questions 34,260 real world couples • Do you like horror movies? • Have you ever traveled around another country alone? • Wouldn't it be fun to chuck it all and go live on a sailboat? 3.7× Top 3 user rated questions, about: • God • Sex • Smoking Likelihood ofcoincidence Psychology Data

  11. Online Dating and Big Data • eHarmony • Analyzes hundreds of behavioral variables, most collected automatically • CTO: former search engineer at Yahoo! • OkCupid We do math to get you dates • Founded by Harvardmath & CS majors • PlentyOfFish Building this matching system was harder than [being] cited in the paper that won the Fields Medal Source: CNN Money

  12. Early 1900s: Most Factories Had Private Generators Source: Nicholas Carr Electricity was critical for business, but not widely available

  13. Data analytics provide remarkable insight Applications in many disciplines Is he an engineer? Does she dateengineers? Source: OkCupid

  14. What Is Data Science? • Also known as … • Big Data analytics • Machine intelligence • Data-intensive computing • Data wrangling • Data munging • Data jujitsu Source: Drew Conway

  15. Two Improving Machine TranslationThe Unreasonable Effectiveness of Data

  16. 2005 NIST Machine Translation Competition English-Arabic competition • Google’s first entry • None of the engineers spoke Arabic • Simple statistical approach • Trained using United Nations documents • 200 million translated words • 1 trillion monolingual words

  17. For many hard problems there appears to be a threshold of sufficient data A. Halevy, et al., CACM 2009.

  18. What is Security Data Science? • Also known as … • Security analytics • Surveillance analytics • Applying data science methods to security problems

  19. Security Principles in 60 Seconds [J. Saltzer & M. Schroeder, SOSP 1973] • Economy of mechanism: Keep the protection mechanism as simple and small as possible • Fail-safe defaults: Base access decisions on permission rather than exclusion • Complete mediation: Check every access to every object • Open design: Do not keep the design secret • Separation of privilege: Require two keys to unlock, not one • Least privilege: Grant every program/user the least set of privileges necessary to complete the job • Least common mechanism: Minimize the amount of mechanism common to more than one user and depended on by all users • Psychological acceptability: Design interfaces for ease of use

  20. Security in Practice(Source: C. Nachenberg, Symantec) • 1986: Simple computer viruses • Defense: anti-virus • 1990: Polymorphic viruses (decryption logic + encrypted malicious code) • Defense: “universal” decoder, emulation • 1995: Macro viruses • Defense: AV vendor cooperation, digital signatures for macros • 1999: Worms • Defense: Vulnerability-specific signatures • 2004: Web-based malware • Defense: behavior blocking • 2006: Auto-generated malware • Defense: reputation based security • 2010(but probably earlier): Targeted attacks (physical infrastructure, 0-day, etc.) • Defense: ??

  21. Three Understanding Zero-Day AttacksThe Need for Security Data Science

  22. Zero-Day Attacks: Recent Examples Zero-day attack = cyber attack exploiting a software vulnerability before the public disclosure of the vulnerability 2011: Attack against RSA 2010: Stuxnet 2009: Operation Auroraagainst Google

  23. Price of Zero-Day Exploits on the Black Market The Economist, March 2013

  24. The Elderwood Project Group with “seemingly unlimited” supply of zero-day exploits (Source: Symantec)

  25. Zero-Day Attacks: Open Questions Decade-long open questions • How common are zero-day attacks? • How long can they remain undiscovered? • What happens after disclosure? Zero-day attack Prior work [Arbaugh 2000, Frei 2008, McQueen 2009, Shahzad 2012] Vulnerabilitytimeline Vulnerability disclosed(“day zero”) Security patch released All hosts patched Creation Exploit used in attacks

  26. Zero-Day Attacks: Open Questions (cont’d) Decade-long questions: Why still open? • Rare events, hard to observe in small data sets • Need data analysis at scale Rare events Before disclosure:Targeted attacks After disclosure:Large-scale attacks [weeks] Vulnerability disclosed(“day zero”) Security patch released All hosts patched Creation Exploit used in attacks

  27. Research in Security Data Science 105 Challenge 1: Find the needle in the haystack • Example: Identify and measure zero-day attacks Challenge 2: Ensure generally applicable and repeatable results • The threat landscape changes frequently Challenge 3: Deal with new and advanced threats • Skilled and persistent hackers can bypass firewalls, anti-virus, password-protected systems, two-factor authentication, physical isolation […] 103 Variants 10 403 million new malware variants created in 2011 Rare events (weeks) -100 -50 T0 50 100 150 Targeted attacks before disclosure Your thesis topic goes here

  28. What is Security Data Science? (re-visited) • Systems knowledge: develop technologies needed to store and process massive data sets • Statistics & machine learning knowledge: analyze the data and extract information • Security knowledge: ask the right questions about cyber attacks • Data scientists are in high demand in the cybersecurity industry Booz Allen may be recruiting more [data scientists] than Google or Facebook The Economist, June 2013

  29. Course Content • Introduction to Security Data Science • Hands-on emphasis – this is largely an unexplored research area • Team-based projects • Reviews of scholarly publications • No textbook • Specific things you can expect to learn • Selected topics in security • System skills: Experiment design, data analysis, scalability • Team skills: Cooperating to achieve your team goals • Speaking/writing skills: Presenting paper/project findings, providing constructive feedback

  30. This is an Advanced Course • You are responsible for holding up your end of the educational bargain • I expect you to attend classes and to complete reading assignments • I expect you to learn how to analyze data and to try things out for yourself • I expect you to know how to find research literature on security topics • The required readings provide starting points • I expect you to manage your time • In general there will be one written assignment due before each lecture • Learning material in this course requires participation • This is not a sit-back-and-listen kind of course; class participation is required for understanding the material and makes up a part of your grade! • Different grading criteria for graduate and undergraduate students

  31. Reading Assignments • Readings: 1-2 papers before each lecture • Not light reading – some papers require several readings to understand • For next time: C. Kanich et al., 'Spamalytics: An Empirical Analysis of Spam Marketing Conversion,'ACM CCS, 2008. • Check course web page (still in flux) for next readings and links to papers • Homeworks: review the papers you read using a defined template • Submit homework by email to tdumitra@umiacs.umd.edu • We might switch to a Web based submission system in the future • Due at 6 pm the evening before class • BibTeX template: Summary, Contributions, Weaknesses, Opinion (optional) • I will provide feedback on someof your written critiques; no email means your writeup is satisfactory • In-class discussion: stand up and talk about the papers • Volunteers are preferred • Students randomly selected if no volunteers

  32. Discuss … Do my date and I have long-term potential? … ask: 275,000 user submitted questions 34,260 real world couples • Do you like horror movies? • Have you ever traveled around another country alone? • Wouldn't it be fun to chuck it all and go live on a sailboat? 3.7× Top 3 user rated questions, about: • God • Sex • Smoking Likelihood ofcoincidence Psychology Data

  33. Course Projects • Pilot project: two-week individual projects • Propose a security problem and a data set that you could analyze to solve it • Some ideas are available on the web page • Conduct preliminary data analysis and write a report • Propose projects by September 9th(soft deadline) • Submit report by September 18th • Group project: ten-week group project • Deeper investigation of promising approaches • Submit written report and present findings during last week of class • 2 checkpoints along the way (schedule on the course web page) • Form teams and propose projects by September 30th • Peer reviews: review at least 2 project reports from other students • Use skills learned from paper reviews • Post project proposals, reports and reviews on Piazza

  34. Pre-Requisite Knowledge • Good programming skills • Knowledge of languages commonly used in data analysis, like Matlab or R, is a plus • To brush up: ‘Data Analysis and Visualization with MATLAB for Beginners’ seminar, on September 12 at 5pm, Room 1110 Kim Engineering Building • Ability to come up to speed on advanced security topics • Covered in the paper readings • Basic knowledge of security (CMSC 414, ENEE 459C or equivalent) is a plus • Ability to come up to speed on data analytics • Lectures provide light-duty tutorials, but you will need to pick up the details as you go along

  35. Policies • “Showing up is 80% of life” – Woody Allen • Participation in in-class discussions is required for full credit • You can get an “A” with a few missed assignments, but reserve these for emergencies (conference trips, waking up sick, etc.) • Notify the instructor if you need to miss a class, and submit your homework on time • UMD’s Code of Academic Integrity applies, modified as follows: • Complete your homework entirely on your own. Afteryou hand in your homework, you are welcome (and encouraged) to discuss it with others • Discussthe problems and concepts involved in the project, but produce your own project implementation, report and presentation • Group projects are the result of team work • See class web site for the official version

  36. Classroom Protocol • Please arrive on time; lecture begins promptly • I also promise to end on time • Handouts, readings and homework templates posted class web page • Questions are encouraged • If you don’t understand, ask; probably other students are struggling too • Explain the content of your reading assignment, and the underlying reasoning, to the rest of the class • Your reasons don't have to be "right” –you just have to be able to explain them • There is no way to cover everything • If there is an interesting aspect that we do not cover in class, feel free to incorporate that in your projects

  37. Grading Criteria • Straight scale: A≥90; B≥80; C≥70; D<70 • 50% Written paper critique and class discussion • 24 assignments x 2 points each + 2 points for this lecture • 50% Projects • 30 points for group project, 10 points for pilot project, 10 points for project reviews • 10% Subjective evaluation • Expectations • Graduate students: you can explain the contributions and weaknesses of the papers you read • Undergraduates: you demonstrate a general understanding of the papers • Unsatisfactory participation means: • You did not read the papers • You did not produce a working implementation for your project, or you do not understand how the implementation works

  38. Review of Lecture • What did we learn? • Data analytics provide real benefits • Analyzing large data sets allows tackling long-standing hard problems • Difference between security principles and security in practice • Examples of security problems that require insights from large data sets • I want to emphasize • This is systems course, not a not a pen-and-paper course • You will be expected to build a real, working, data analysis tool • What’s next? • Basic statistics and experimental design • Pilot project: proposal, approach, expectations • Deadline reminder • Post pilot project proposal on Piazza by Monday (soft deadline) • First homework due on Sunday at 6 pm

  39. Dive Inhttp://ter.ps/759d

More Related