1 / 33

Empirical Evaluation of innovations in automatic repair

Empirical Evaluation of innovations in automatic repair. Claire Le Goues Site visit February 7, 2013. “Benchmarks set standards for innovation, and can encourage or stifle it.” -Blackburn et al. Automatic program repair Over time. 2009: 15 papers on automatic program repair*

Download Presentation

Empirical Evaluation of innovations in automatic repair

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Empirical Evaluation of innovations in automatic repair Claire Le Goues Site visit February 7, 2013

  2. “Benchmarks set standards for innovation, and can encourage or stifle it.”-Blackburn et al.

  3. Automatic program repair Over time 2009: 15 papers on automatic program repair* 2011: Dagstuhl seminar on self-repairing programs 2012: 30 papers on automatic program repair* 2013: dedicated program repair track at ICSE *manually reviewed the results of a search of the ACM digital library for “automatic program repair”

  4. Current approach • Manually sift through bugtraq data. • Indicative example: Axis project for automatically repairing concurrency bugs • 9 weeks of sifting to find 8 bugs to study. • Direct quote from Charles Zhang, senior author, on the process: "it's very painful” • Very difficult to compare against previous or related work or generate sufficiently large datasets.

  5. Goal: High-Quality Empirical evaluation

  6. SubGoal: High-Quality Benchmark Suite

  7. Benchmark Requirements • Indicative of important real-world bugs, found systematically in open-source programs. • Support a variety of research objectives. • “Latitudinal” studies: many different types of bugs and programs • “Longitudinal” studies: many iterative bugs in one program. • Scientifically meaningful: passing test cases  repair • Admit push-button, simple integration with tools like GenProg.

  8. Benchmark Requirements • Indicative of important real-world bugs, found systematically in open-source programs. • Support a variety of research objectives. • “Latitudinal” studies: many different types of bugs and programs • “Longitudinal” studies: many iterative bugs in one program. • Scientifically meaningful: passing test cases  repair • Admit push-button, simple integration with tools like GenProg.

  9. Systematic Benchmark Selection • Goal: a large set of important, reproduciblebugs in non-trivialprograms. • Approach: use historical data to approximate discovery and repair of bugs in the wild. http://genprog.cs.virginia.edu

  10. New bugs, new programs • Indicative of important real-world bugs, found systematically in open-source programs: • Add new programs to the set, with as wide a variety of types as possible (support “latitudinal” studies) • Support a variety of research objectives: • Allow studies of iterative bugs, development, and repair: generate a very large (100) set of bugs in one program (php) (support “longitudinal” studies).

  11. Benchmark Requirements • Indicative of important real-world bugs, found systematically in open-source programs. • Support a variety of research objectives. • “Latitudinal” studies: many different types of bugs and programs • “Longitudinal” studies: many iterative bugs in one program. • Scientifically meaningful: passing test cases  repair • Admit push-button, simple integration with tools like GenProg.

  12. Test Case Challenges • They must exist. • Sometimes, but not always, true (see: Jonathan Dorn)

  13. Benchmarks

  14. Test Case Challenges • They must exist. • Sometimes, but not always, true (see: Jonathan Dorn) • They should be of high quality. • This has been a challenge from day 0: nullhttpd • Lincoln labs noticed it too: sort • In both cases, adding test cases led to better repairs.

  15. Test Case Challenges • They must exist. • Sometimes, but not always, true (see: Jonathan Dorn) • They should be of high quality. • This has been a challenge from day 0: nullhttpd • Lincoln labs noticed it too: sort • In both cases, adding test cases led to better repairs. • They must be automated to run one at a time, programmatically, from within another framework.

  16. Push-button Integration • Need to be able to compile and run new variants programmatically. • Need to be able to run test cases one at a time. • It’s not simple, and as we scale up to real-world systems, becomes increasingly tricky. • Much of the challenge is unrelated to the program in question, instead requiring highly-technical knowledge of OS-level details.

  17. Digression on wait() • Calling a process from within another process : • system(“run test 1”)  ...; wait() • wait() returns the process exit status. • This is complex. • Example: a system call can fail because the OS ran out of memory in creating the process, or because the process itself ran out of memory. • How do we tell the difference? • Answer: bit masking

  18. Real-world Complexity • Moral: integration is tricky, and lends itself to human mistakes. • Possibility 1: original programmers make mistakes in developing the test suite. • Test cases can have bugs, too. • Possibility 2: we (GenProgdevs/users) make mistakes in integration. • A few old php test cases are not to our standards; faulty bitshift math for extracting the return value components.

  19. Integration Concerns • Interested in more, better benchmark design, with easy integration (without gnarly OS details). • Virtual machines provide one approach. • Need a better definition of “high quality test case” vs. “low quality test case:” • Can the empty program pass it? • Can every program pass it? • Can the “always crashes” program pass it?

  20. Current Repair Success • Over the past year, we have conducted studies of representation and operators for automatic program repair: • One-point crossover on patch representation. • Non-uniform mutation operator selection. • Alternative fault localization framework. • Results on the next slide incorporate “all the bells and whistles:” • Improvements based on those large-scale studies. • Manually confirmed quality of testing framework.

  21. Current Repair Success

  22. Transition

  23. Repair Templates Claire Le Goues Shirley Park DARPA Site visit February 7, 2013

  24. BIO + CS INTERACTION 25

  25. Immunology: T-cells 26 • Immune response is equally fast for large and small animals. • Human lung is 100x larger than mouse lung, still finds influenza infections in ~8 hours. • Successfully balances local search and global response. • Balance between generic and specialized T-cells: • Rapid response to new pathogens vs.long-term memory of previous infections (cf. vaccines).

  26. INPUT EVALUATE FITNESS DISCARD ACCEPT OUTPUT MUTATE

  27. Automatic software repair 28 • Tradeoff between generic mutation actions and more specific action templates: • Generic:INSERT, DELETE, REPLACE • Specific: if ( != NULL) { <code using > }

  28. Hypothesis: GenProg can repair more bugs, and repair bugs more quickly, if we augment mutation actions with “repair templates.”

  29. Option 1: Previous Changes • Insight: Just like T-cells “remember” previous infections, abstract previous fixes to generate new mutations. • Approach: • Model previous changes using structured documentation. • Cluster a large set of changes by similarity. • Abstract the center of each cluster • Example: • if( < 0) • return 0; • else • <code using >

  30. Option 2: Existing behavior • Insight: Looking up things at a library provides people with the best example of what they are looking to reproduce. • Approach: • Generate static paths through C programs. • Mine API usage patterns from those paths • Abstract the patterns into mutation templates. • Example: • while(it.hasnext()) • <code using it.next()>

  31. This work is ongoing.

  32. Conclusions We are generating a benchmark suite to support GenProg research, integration and tech transfer, and the automatic repair community at large. Current GenProg results for 12-hour repair scenario: 87/163 (53%) of real-world bugs in dataset. Repair templates will augment GenProg’s mutation operators to help repair more bugs, and repair bugs more quickly.

More Related