See More, Learn More, Tell More

See More, Learn More, Tell More Tim Menzies West Virginia Universitytim@menzies.us Galaxy Global: Robert (Mike) Chapman Justin Di Stefano NASA IV&V: Kenneth McGill Pat Callis WVU: Kareem Ammar JPL: Allen Nikora DN American: John Davis

See more! Learn more! Tell more! Important: transition to the broader NASA software community What’s unique about OSMA research?

Old dialogue: “v(G)>10 is a good thing” • New dialogue • A core method in my analysis was M1 • I have compared M1 to M2 • On NASA-related data • Using criteria C1 • I argue for the merits of C1 as follows • possible via a discussion on C2, C3,… • I endorse/reject M1 because of that comparison Show me the money! Show me the data!

The IV&V Holy Grail • Learn “stuff” in early lifecycle • That would lead to late lifecycle errors • Actions: • change that stuff now OR • plan more IV&V on that stuff Better watch out! Hey, that’s funny English bubbles (e.g. UML, block diagrams) code traces issues

The IV&V metrics repository • Galaxy Global (P.I.= Robert (Mike) Chapman) • NASA P.O.C.= Pat Callis • Cost: $0 • Currently: • Mostly code metrics on a small number of projects • Defect fields and static code measures • Also, for C++ code, some class-based metrics • Real soon: • Requirements mapped to code functions, plus defect logs • For only 1 project • Some time in the near future • As above, for more projects http://mdp.ivv.nasa.gov/

Repositories or Sarcophagus?(use it or lose it!) data sarcophagus active data repository

Ammar, Kareem Callis, Pat Chapman, Mike Cukic , Bojan Davis, John Di Stefano, Justin Goa, Lan McGill, Kenneth Menzies,Tim Dekhtyar, Alex Hayes, Jane Merritt, Phillip Nikora, Allen Orrego, Andres Wallace, Dolores Wilson, Aaron Who’s using MDP data?

Not perfect predictors Hints that let us focus our effort Example: V(g)= Mccabe’s cyclomatic complexity= pathways through a function if V(g)>10 then predict detects Traffic light browser: Green: No faults known/predicted Red: faults detected I.e. link to a fault database Yellow: faults predicted Based on past experience Learn via data mining Project2: Learn defect detectors from code Used MDP data

Static code metrics for defect detection= a very bad idea? • Better idea: • model-based methods to study deep semantics of this code • E.g. Heimdahl, Menzies, Owen, et al • E.g. Owen, Menzies • High cost of model-based methods • How about cheaper alternatives?

all attributes learning decisions some attributes [Ammar, Menzies, Nikora, 2003] learning C4.5 Naïve Bayes Magic Static code metrics for defect detection= a very bad idea? • Shephard & Ince: • “… (cyclomatic complexity) based upon poor theoretical foundations and an inadequate model of software development • “… for a large class of software it is no more than a proxy for, and in many cases outperformed by, lines of code.” • High utility of “mere” LOC also seen by: • Chapman and Solomon (2003) • Folks wasting their time: • Porter and Selby (1990) • Tian and Zelkowitz (1995) • Khoshgoftaar and Allen (2001) • Lan Goa & Cukic (2003) • Menzies et.al.(2003) • Etc etc Used MDP data

Shephard & Ince:Simple LOC out-performs others(according to “correlation”) • “Correlation” is not “decision” • Correlation • defects= 0.0164+0.0114*LOC • Classification: • How often does theory correctly classify correct/incorrect examples? • E.g. (0.0164+0.0114*LOC) >= 1 • Learning classifiers is different to learning correlations • Best classifiers found by [Ammar, Menzies, Nikora 2003] did NOT use LOC: • KC2: ev(g) >= 4.99 • JM1: unique operands >= 60.48 Used MDP data

Accuracy, like correlation, can miss vital features Same accuracy/correlations Different detection, false alarm rates Detector1: 0.0164+0.0114*LOC correlation(0.0164+0.0114*LOC) = 0.66% classification((0.0164+0.0114*LOC)>1) = 80% LSR: loc Detector2: 0.0216 +0.0954*v(g) - 0.109*ev(g) + 0.0598*iv(g) LSR: Mccabes Detector3: 0.00892 -0.00432*uniqOp + 0.0147*uniqOpnd -0.01* totalOp + 0.0225* totalOpnd LSR: Halstead [Ammar, Menzies, Nikora 2003]Astonishingly few metrics required to generate accurate defect detectors Used MDP data Accuracy = (A+D)/(A+B+C+D) False alarm = PF = C/(A+C) Got it right = PD = D/(B+D) %effort = (locC+locD) / (locA + locB + locC + locD)

PD PF Don’t assess defect defectors on just accuracy • [Menzies, Di Stefano, Ammar, McGill, Callis, Chapman, Davis , 2003]: • A study of 300+ detectors generated from MDP logs • Sailing on the smooth sea of accuracy • With many unseen rocks below • “Rocks”= • huge variations in false alarm/ detection/effort probabilities accuracy %effort Accuracy = (A+D)/(A+B+C+D) False alarm = PF = C/(A+C) Got it right = PD = D/(B+D) %effort = (locC+locD) / (locA + locB + locC + locD) Used MDP data

Not “one ring torule them all” v(g) ≥ 10 Risk-adverse projects: high PD, not very high PFs Cost-adverse projects (low PF): don’t waste time of chasing false alarms New relationship projects (low PF): don’t tell client anything dumb Time-constrained projects (low effort): limited time for inspections Writing coding standards (categorical statements) • Detectors tuned to the needs of your domain accuracy %effort PD Accuracy = (A+D)/(A+B+C+D) False alarm = PF = C/(A+C) Got it right = PD = D/(B+D) %effort = (locC+locD) / (locA + locB + locC + locD) PF

IQ = automatically exploring all detector trade-offs “sweet spot” = <1,1> PD 1 - Effort • Setup • decide your evaluation criteria • 0 <= criteria <= 1 • “good” = 1, “bad”= 0 • E.g. pf, pd, effort, cost, support,… • Interesting detector: optimal on some pair of criteria • quickly generate a lot of detectors • Repeat: • add in three “artificial detectors” • draw a lasso around all the points; i.e. compute the convex hull • forget the un-interesting detectors (inside the hull) • generate combinations (conjunctions) of detectors on hull • Until hull stops growing towards the “sweet spot” • IQ= iterative quick hull • For N>2 criteria: • Draw hull for N!/2*(N-2)!= (N*(N-1))/2 combinations • A detector is interesting if it appears on the hull in any of these combinations • Result: a few, very interesting, detectors

pf pd pd effort effort pf pd effort pf IQ: generates a wide-range of interesting detectors effort 20% Used MDP data L >= 0.044 and Uniq_Opnd >= 25.80 and Total_Op >= 22.03 and Total_Opnd >= 37.16 and acc 80% pd 24% pf 5% effort 40% Branch_Count >= 3.30 and N >= 6.22 and L >= 0.044 acc 65% pd 45% pf 30% effort 52% accuracy %effort acc 20% L >= 0.044 and N >= 6.2 pd 62% pf 90% PF pd 21% Learnt by decision trees and Naïve Bayes PD pf 7%

Future work Uses MDP data • Is it always domain-specific tuning? • Lan Goa & Cukic (2003) • Applying detectors learnt from project1 to project2 • The “DEFECT 1000” • Menzies, Massey, et.al. (2004+) • For 1000 projects • Learn defect detectors • Look for repeated patterns • But what if no repeated detectors? • Is tuning always domain-specific? • No general rules? • What advice do we give projects? • Collect your defect logs • Define your local assessment criteria • Learn, until conclusions stabilize • Check old conclusions against new

See more! Learn more! Tell more! Conclusions (1) v(g) ≥ 10 • Thanks to NASA’s Metrics Data Program, we are • seeing more, • learning more. • telling more (this talk) • Not “one ring to rule them all”; e.g. “v(g) >= 10” • Prior pessimism about defect detectors premature • correlation and classification accuracy : not enough • The flat sea of accuracy and the savage rocks below accuracy %effort PD PF

pf pd pd effort effort pf pd effort pf Conclusions (2) • Defect detection subtler that we’d thought • Effort,, pf, pd, … • Standard data miners • insensitive to these subtleties • IQ: a new data miner • Level-playing field to compare different techniques • entropy vs IR vs DS vs mccabes vs IQ vs … accuracy %effort PF PD Show me the data!

Conclusions (3) Use MDP data http://mdp.ivv.nasa.gov/

Supplemental material

Yet more • Accuracy, • PD • PF • Effort • Distance to some user-goal • E.g. effort=25% • Cost: • Mccabes license= $50K/year • Precision: • D/(C+D) • Support: • (C+D)/(A+B+C+D) • External validity • Stability: • N-way cross-val • (note: support can predict for stability) • Generality • Same result in multiple projects • Lift: • Change in weighted sum of classes • Etc • Problem: can’t do it using standard data miners Accuracy = (A+D)/(A+B+C+D) False alarm = PF = C/(A+C) Got it right = PD = D/(B+D) %effort = (locC+locD) / (locA + locB + locC + locD)

acc acc acc accuracy pd PD pd pd PF pf pf pf Standard data miners optimize for accuracy, miss interesting options Used MDP data [10-way cross validation]

What features of requirements (in English) that predict for defects (in “C”)? What features do we extract from English text? ARM: phrases: Weak; e.g. “maybe” Strong; e.g. “must” Continuations; e.g. “and”, “but”,… Porter’s algorithm Stemming LEG, total/unique: words verbs nouns dull words; e.g. “a”, “be”,… interesting words; i.e. data dictionary etc WORDNET: how to find: nouns, verbs, synonyms Project1: Learning what is a “bad” requirement Uses MDP data

PACE: browser for detector trade-offs Phillip Aaron Merritt Wilson

data miner Project2: Learning “holes” and “poles” in models Requires: detailed knowledge of internals Requires: assessment criteria of outputs

STEREO= “solar terrestrial relations observatory”

See More, Learn More, Tell More

See More, Learn More, Tell More

Presentation Transcript

Tell Me More!

learn more 4

Learn More Indiana

More, More, More!

Tell Them More!

More is More

Learn More, Earn More, Live More with CT.

Learn More!

For more information see:

Click to see more…

LEARN MORE

More, More, More!

More to Learn

Tell-Me-More

Tell Me More - English

Learn More

Tell me more

Learn More

Learn More Here

Learn More Here

Learn More Here

Learn More Here