1 / 27

See More, Learn More, Tell More

See More, Learn More, Tell More. Tim Menzies West Virginia University tim@menzies.us. Galaxy Global: Robert (Mike) Chapman Justin Di Stefano NASA IV&V: Kenneth McGill Pat Callis WVU: Kareem Ammar JPL: Allen Nikora DN American: John Davis. See more!. Learn more!. Tell

rcrandall
Download Presentation

See More, Learn More, Tell More

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. See More, Learn More, Tell More Tim Menzies West Virginia Universitytim@menzies.us Galaxy Global: Robert (Mike) Chapman Justin Di Stefano NASA IV&V: Kenneth McGill Pat Callis WVU: Kareem Ammar JPL: Allen Nikora DN American: John Davis

  2. See more! Learn more! Tell more! Important: transition to the broader NASA software community What’s unique about OSMA research?

  3. Old dialogue: “v(G)>10 is a good thing” • New dialogue • A core method in my analysis was M1 • I have compared M1 to M2 • On NASA-related data • Using criteria C1 • I argue for the merits of C1 as follows • possible via a discussion on C2, C3,… • I endorse/reject M1 because of that comparison Show me the money! Show me the data!

  4. The IV&V Holy Grail • Learn “stuff” in early lifecycle • That would lead to late lifecycle errors • Actions: • change that stuff now OR • plan more IV&V on that stuff Better watch out! Hey, that’s funny English bubbles (e.g. UML, block diagrams) code traces issues

  5. The IV&V metrics repository • Galaxy Global (P.I.= Robert (Mike) Chapman) • NASA P.O.C.= Pat Callis • Cost: $0 • Currently: • Mostly code metrics on a small number of projects • Defect fields and static code measures • Also, for C++ code, some class-based metrics • Real soon: • Requirements mapped to code functions, plus defect logs • For only 1 project • Some time in the near future • As above, for more projects http://mdp.ivv.nasa.gov/

  6. Repositories or Sarcophagus?(use it or lose it!) data sarcophagus active data repository

  7. Ammar, Kareem Callis, Pat Chapman, Mike Cukic , Bojan Davis, John Di Stefano, Justin Goa, Lan McGill, Kenneth Menzies,Tim Dekhtyar, Alex Hayes, Jane Merritt, Phillip Nikora, Allen Orrego, Andres Wallace, Dolores Wilson, Aaron Who’s using MDP data?

  8. Not perfect predictors Hints that let us focus our effort Example: V(g)= Mccabe’s cyclomatic complexity= pathways through a function if V(g)>10 then predict detects Traffic light browser: Green: No faults known/predicted Red: faults detected I.e. link to a fault database Yellow: faults predicted Based on past experience Learn via data mining Project2: Learn defect detectors from code Used MDP data

  9. Static code metrics for defect detection= a very bad idea? • Better idea: • model-based methods to study deep semantics of this code • E.g. Heimdahl, Menzies, Owen, et al • E.g. Owen, Menzies • High cost of model-based methods • How about cheaper alternatives?

  10. all attributes learning decisions some attributes [Ammar, Menzies, Nikora, 2003] learning C4.5 Naïve Bayes Magic Static code metrics for defect detection= a very bad idea? • Shephard & Ince: • “… (cyclomatic complexity) based upon poor theoretical foundations and an inadequate model of software development • “… for a large class of software it is no more than a proxy for, and in many cases outperformed by, lines of code.” • High utility of “mere” LOC also seen by: • Chapman and Solomon (2003) • Folks wasting their time: • Porter and Selby (1990) • Tian and Zelkowitz (1995) • Khoshgoftaar and Allen (2001) • Lan Goa & Cukic (2003) • Menzies et.al.(2003) • Etc etc Used MDP data

  11. Shephard & Ince:Simple LOC out-performs others(according to “correlation”) • “Correlation” is not “decision” • Correlation • defects= 0.0164+0.0114*LOC • Classification: • How often does theory correctly classify correct/incorrect examples? • E.g. (0.0164+0.0114*LOC) >= 1 • Learning classifiers is different to learning correlations • Best classifiers found by [Ammar, Menzies, Nikora 2003] did NOT use LOC: • KC2: ev(g) >= 4.99 • JM1: unique operands >= 60.48 Used MDP data

  12. Accuracy, like correlation, can miss vital features Same accuracy/correlations Different detection, false alarm rates Detector1: 0.0164+0.0114*LOC correlation(0.0164+0.0114*LOC) = 0.66% classification((0.0164+0.0114*LOC)>1) = 80% LSR: loc Detector2: 0.0216 +0.0954*v(g) - 0.109*ev(g) + 0.0598*iv(g) LSR: Mccabes Detector3: 0.00892 -0.00432*uniqOp + 0.0147*uniqOpnd -0.01* totalOp + 0.0225* totalOpnd LSR: Halstead [Ammar, Menzies, Nikora 2003]Astonishingly few metrics required to generate accurate defect detectors Used MDP data Accuracy = (A+D)/(A+B+C+D) False alarm = PF = C/(A+C) Got it right = PD = D/(B+D) %effort = (locC+locD) / (locA + locB + locC + locD)

  13. PD PF Don’t assess defect defectors on just accuracy • [Menzies, Di Stefano, Ammar, McGill, Callis, Chapman, Davis , 2003]: • A study of 300+ detectors generated from MDP logs • Sailing on the smooth sea of accuracy • With many unseen rocks below • “Rocks”= • huge variations in false alarm/ detection/effort probabilities accuracy %effort Accuracy = (A+D)/(A+B+C+D) False alarm = PF = C/(A+C) Got it right = PD = D/(B+D) %effort = (locC+locD) / (locA + locB + locC + locD) Used MDP data

  14. Not “one ring torule them all” v(g) ≥ 10 Risk-adverse projects: high PD, not very high PFs Cost-adverse projects (low PF): don’t waste time of chasing false alarms New relationship projects (low PF): don’t tell client anything dumb Time-constrained projects (low effort): limited time for inspections Writing coding standards (categorical statements) • Detectors tuned to the needs of your domain accuracy %effort PD Accuracy = (A+D)/(A+B+C+D) False alarm = PF = C/(A+C) Got it right = PD = D/(B+D) %effort = (locC+locD) / (locA + locB + locC + locD) PF

  15. IQ = automatically exploring all detector trade-offs “sweet spot” = <1,1> PD 1 - Effort • Setup • decide your evaluation criteria • 0 <= criteria <= 1 • “good” = 1, “bad”= 0 • E.g. pf, pd, effort, cost, support,… • Interesting detector: optimal on some pair of criteria • quickly generate a lot of detectors • Repeat: • add in three “artificial detectors” • draw a lasso around all the points; i.e. compute the convex hull • forget the un-interesting detectors (inside the hull) • generate combinations (conjunctions) of detectors on hull • Until hull stops growing towards the “sweet spot” • IQ= iterative quick hull • For N>2 criteria: • Draw hull for N!/2*(N-2)!= (N*(N-1))/2 combinations • A detector is interesting if it appears on the hull in any of these combinations • Result: a few, very interesting, detectors

  16. pf pd pd effort effort pf pd effort pf IQ: generates a wide-range of interesting detectors effort 20% Used MDP data L >= 0.044 and Uniq_Opnd >= 25.80 and Total_Op >= 22.03 and Total_Opnd >= 37.16 and acc 80% pd 24% pf 5% effort 40% Branch_Count >= 3.30 and N >= 6.22 and L >= 0.044 acc 65% pd 45% pf 30% effort 52% accuracy %effort acc 20% L >= 0.044 and N >= 6.2 pd 62% pf 90% PF pd 21% Learnt by decision trees and Naïve Bayes PD pf 7%

  17. Future work Uses MDP data • Is it always domain-specific tuning? • Lan Goa & Cukic (2003) • Applying detectors learnt from project1 to project2 • The “DEFECT 1000” • Menzies, Massey, et.al. (2004+) • For 1000 projects • Learn defect detectors • Look for repeated patterns • But what if no repeated detectors? • Is tuning always domain-specific? • No general rules? • What advice do we give projects? • Collect your defect logs • Define your local assessment criteria • Learn, until conclusions stabilize • Check old conclusions against new

  18. See more! Learn more! Tell more! Conclusions (1) v(g) ≥ 10 • Thanks to NASA’s Metrics Data Program, we are • seeing more, • learning more. • telling more (this talk) • Not “one ring to rule them all”; e.g. “v(g) >= 10” • Prior pessimism about defect detectors premature • correlation and classification accuracy : not enough • The flat sea of accuracy and the savage rocks below accuracy %effort PD PF

  19. pf pd pd effort effort pf pd effort pf Conclusions (2) • Defect detection subtler that we’d thought • Effort,, pf, pd, … • Standard data miners • insensitive to these subtleties • IQ: a new data miner • Level-playing field to compare different techniques • entropy vs IR vs DS vs mccabes vs IQ vs … accuracy %effort PF PD Show me the data!

  20. Conclusions (3) Use MDP data http://mdp.ivv.nasa.gov/

  21. Supplemental material

  22. Yet more • Accuracy, • PD • PF • Effort • Distance to some user-goal • E.g. effort=25% • Cost: • Mccabes license= $50K/year • Precision: • D/(C+D) • Support: • (C+D)/(A+B+C+D) • External validity • Stability: • N-way cross-val • (note: support can predict for stability) • Generality • Same result in multiple projects • Lift: • Change in weighted sum of classes • Etc • Problem: can’t do it using standard data miners Accuracy = (A+D)/(A+B+C+D) False alarm = PF = C/(A+C) Got it right = PD = D/(B+D) %effort = (locC+locD) / (locA + locB + locC + locD)

  23. acc acc acc accuracy pd PD pd pd PF pf pf pf Standard data miners optimize for accuracy, miss interesting options Used MDP data [10-way cross validation]

  24. What features of requirements (in English) that predict for defects (in “C”)? What features do we extract from English text? ARM: phrases: Weak; e.g. “maybe” Strong; e.g. “must” Continuations; e.g. “and”, “but”,… Porter’s algorithm Stemming LEG, total/unique: words verbs nouns dull words; e.g. “a”, “be”,… interesting words; i.e. data dictionary etc WORDNET: how to find: nouns, verbs, synonyms Project1: Learning what is a “bad” requirement Uses MDP data

  25. PACE: browser for detector trade-offs Phillip Aaron Merritt Wilson

  26. data miner Project2: Learning “holes” and “poles” in models Requires: detailed knowledge of internals Requires: assessment criteria of outputs

  27. STEREO= “solar terrestrial relations observatory”

More Related