Comparing Bug Finding Tools with Reviews and Tests

Comparing Bug Finding Toolswith Reviews and Tests Stefan Wagner, Jan Jurjens, Claudia Koller, and Peter Trischberger By Shankar Kumar

AGENDA • 1. Motivation • 2. Overview • 3. Bug Finding Tools • 4. Projects • 5. Approach • 6. Analysis • 7. Conclusions

1. Motivation • 2. Overview • 3. Bug Finding Tools • 4. Projects • 5. Approach • 6. Analysis • 7. Conclusions

MOTIVATION • Defect-detection techniques (reviews, tests, . . . ) are still the most important methods for the improvement of software quality • Costs for such techniques are significant. According to Myers (1979) 50% of the total development costs are caused by testing. Jones (1987) assigns 30 . 40% to quality assurance and defect removal. • Optimization in this area can save money! • One way can be automation • Another the optimal combination of techniques Bug finding tools offer the possibility of automatic defect finding • How do such tools compare to reviews and tests?

OVERVIEW • Comparison of bug finding tools for Java with reviews and tests • Based on several industrial projects at O2 • Different backend applications with web interface • Analysis of • defect categories (severity) • defect types • defect removal efficiency

DEFINITION • Failures are a perceived deviation of the output values from the expected values • faults are the cause of failures in code or other documents. • Both are also referred to as defects. We mainly use defect in our analyses • if there are no failures involved as with defects related to maintenance only.

PROBLEM • We address the question of how automated static analysis using bug finding tools relates to other types of defect-detection techniques and if it is thereby possible to reduce the effort for defect-detection using such tools. • In detail, this amounts to three questions. • 1. Which types and classes of defects are found by different techniques? • 2. Is there any overlap of the found defects? • 3. How large is the ratio of false positives from the tools?

RESULTS • The main findings are summarized in the following. • 1. Bug finding tools detect a subset of the defect types that can be found by a review. • 2. The types of defects that can be found by the tools can be analyzed more thoroughly, that is, the tools are better regarding the bug patterns they are programmed for. • 3. Dynamic tests find completely different defects than bug finding tools. • 4. Bug finding tools have a significant ratio of false positives. • 5. The bug finding tools show very different results in different projects.

CONSEQUENCES • The results have four major implications. • 1. Dynamic tests or reviews cannot be substituted by bug finding tools because they find significantly more and different defect types. • 2. Bug finding tools can be a good pre-stage to a review because some of the defects do not have to be manually detected. A possibility would be to mark problematic code so that it cannot be overlooked in the review. • 3. Bug finding tools can only provide a significant reduction in the effort necessary for defect-detection if their false positives ratios can be reduced. • (From our case studies, we find the current ratios to be not yet completely acceptable.) • 4. The tools have to be more tolerant regarding the programming style and design to provide more uniform results in different projects.

BUG FINDING TOOLS • Class of programs that aim to find defects in code • By static analysis similarly to a compiler • The outcome are warnings that a piece of code is critical in some way • Identification using typical bug patterns from • experience • published common pitfalls in a programming language • Coding guidelines and standards can be checked • Also more sophisticated analysis based on data and • control flow

Three representatives for bug finding tools for Java • Find Bugs : • uses mainly bug patterns but also data flow analysis, operates on the byte code, from University of Maryland • PMD : • operates on source code, can enforce coding • standards, customizable by using X-Path expressions on the parser tree • QJ Pro : • operates on source code, uses over 200 rules, additional rules can be defined, also checks based on code metrics

PROJECTS • Four done at O2 and one at TU Munich • Project A • Online shop • Six month in use • 1066 classes, 58 KLOC • Project B • Internet payment via SMS • 215 classes • 24 KLOC • Project C • Web frontend for management system • 3 KLOC

Project D • Client data management • 572 classes • 34 KLOC • EstA • Textual requirements editor • 28 classes • 4 KLOC

APPROACH • Analyze interrelations between bug finding tools, reviews, and tests • Tools and tests were applied to all projects • Review only in project C • Each warning from the tools was checked if it is really a defect (True positives vs. false positives) • True positives are warnings that are actually confirmed as defects in the code • false positives are wrong identifications of problems. • Techniques were applied independently

DEFECT CATEGORIZATION • For the comparison, we use a five step categorization of the defects using their severity. • Hence, the categorization is based on the effects of the defects rather than their cause or type of occurrence in the code. • We use a standard categorizationfor severity that is slightly adapted to the defects found in the projects. • Defects in category 1 are the severest, the ones in category 5 have the lowest severity.

CATEGORIES • 1. Defects that lead to a crash of the application. • These are the most severe defects that stop the whole application from reacting to any user input. • 2. Defects that cause a logical failure. • This category consists of all defects that cause a logical failure of the application but do not crash it, for example a wrong result value. • 3. Defects with insufficient error handling. • Defects in this category are only minor and do not crash the application or result in logical failures, but are not handled properly. • 4. Defects that violate the principles of structured programming. • These are defects that normally do not impact the software but could result in performance bottlenecks etc. • 5. Defects that reduce the maintainability of the code. • This category contains all defects that only affect the readability or changeability of the software.

ADVANTAGES • This classification helps us • compare the various defect-detection techniques based on the severity of the defects they find • analyze the types of defects that they find.

DEFECT TYPES • Orthogonal to the defect categorization • Similar to standard defect taxonomies (i.e. IEEE Std • 1044-1993) • Partly an unification of the warning types of the tools • Examples • Stream is not closed • Input is not checked for special characters

ANALYSIS • Bug finding tools • Bug finding tools vs. reviews • Bug finding tools vs. testing • Defect removal efficiency

BUG FINDING TOOLS • The number before the slash denotes the number of true positives, the number after the slash the number of all positives.

OBSERVATIONS &INTERPRETATIONS • Most of the true positives can be assigned to the category Maintainability of the code. • It is noticeable that the different tools predominantly find different positives. • Only a single defect type was found from all tools, • four types from two tools each.

CATERGORIES

Find Bugs finds in the different systems positives from all categories • PMD only from the categories • Failure of the application, • Insufficient error handling • Maintainability of the code. • QJ Pro only reveals • failure of the application • Insufficient error handling • Violation of structured programming.

Number of types of defects varies from tool to tool. • Find Bugs detects defects of 13 different types, • PMD of 10 types • QJ Pro only of 4 types. • Ratio of false positives: • Large variation over different projects • Recommendation: • Use Find Bugs and PMD in combination • Use PMD or QJ Pro to enforce coding standards

BFT vs. Reviews • Reviews only on project C • Revealed 19 different types of defects • All defects found by the tools also found by reviews • However, tools sometimes found more defects of a type • But also the other way round • Review can also reveal faults in the logic • Using a tool before reviewing can improve thoroughness

DEFECTS FOUND BY REVIEWS

Review is more successful than bug finding tools, because it is able to detect far more defect types. • However, it seems to be beneficial to first use a bug finding tool before inspecting the code, so that the defects that are found by both are already removed. • This is because the automation makes it cheaper and more thorough than a manual review. • On the downside, we also notice a high number of false positives from all tools. • This results in significant non-productive work for the developers that could in some cases exceed the improvement achieved by the automation.

BFT vs. Testing • Black box and glass box tests • No stress tests • Only 8 defect types were found • Completely distinct from the ones found by the tools • Only in categories 1 to 3 • Stress tests might change this picture • Defects found by tests

For the software systems for which defects were revealed, there were no identical defects found with testing as well as the bug finding tools • Dynamic testing is good at finding logical defects that are best visible • when executing the software, bug finding tools have their strength at finding defects related to maintainability. • we again recommend using both techniques in a project.

Defect Removal Efciency

DISCUSSION • BFTs mainly find maintainability-related defects which complies with expectation • Only reviews and tests are able to verify and validate the logic of the software • The limitation of the tools lies in what is expressible by bug patterns and how good and generic these patterns are… • It is still surprising that there is not a single overlapping defect • Positive: the combination is beneficiary • Negative: the amount of (costly) testing cannot be reduced

However, external validity limited by • sample size • maturity of analyzed software • High ratio of false positives is disillusioning • The expected benefit of automation is diminished by the necessary human intervention • However, tools are sometimes more thorough

CONCLUSION • Not a comprehensive empirical study • But a first indication using a series of projects • Only knowledge about relation of bug finding tools to • reviews and tests • BFTs revealed completely different defects than • dynamic tests • But reveal a subset of the ones found by reviews • Tools are typically more thorough • The effectiveness depends on the personal programming style and software design • The high number of false positives makes the benefits questionable

REFERENCES • 1. T. Ball and S.K. Rajamani. The SLAM Project: Debugging System Software via Static Analysis. In Proc. 29th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2002. • 2. B. Beizer. Software Testing Techniques. Thomson Learning, 2nd edition, 1990. • 3. W.R. Bush, J.D. Pincus, and D.J. Sielaff. A static analyzer for finding dynamic programming errors. Softw. Pract. Exper., 30:775–802, 2000. • 4. R. Chillarege. Orthogonal Defect Classification. In Michael R. Lyu, editor, Handbook of Software Reliability Engineering, chapter 9. IEEE Computer Society Press and McGraw-Hill, 1996. • 5. C. Csallner and Y. Smaragdakis. CnC: Combining Static Checking and Testing. In Proc. 27th International Conference on Software Engineering (ICSE’05), 2005. To appear. • 6. D. Engler and M. Musuvathi. Static Analysis versus Model Checking for Bug Finding. In Proc. Verification, Model Checking and Abstract Interpretation (VMCAI’ 04), volume 2937 of LNCS, pages 191–210. Springer, 2002.

QUESTIONS?

Comparing Bug Finding Tools with Reviews and Tests