1 / 23

Reliability

Reliability. Reliability. Starter Questions. Q : What does reliability have to do with social implications of computing ? Q : How reliable is software? Q : What can software developers do to improve reliability? Q : Should software developers be held responsible for faulty software?.

nickan
Download Presentation

Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reliability Reliability

  2. Starter Questions Q: What does reliability have to do with social implications of computing? Q: How reliable is software? Q: What can software developers do to improve reliability? Q: Should software developers be held responsible for faulty software?

  3. Example Problems O' Plenty • Disenfranchised Voters • Patriot Missile • NASA Mars Polar Lander • 2003 Power Blackout • Therac-25

  4. Disenfranchised Voters • Florida, 2000 • Florida was the only state that paid a private company to purge the voter file of ineligible voters • approximately 8,000 voters improperly excluded from voting • general population : 11% black • incorrectly purged from voter registration list : 88% black • Bush beat Gore by 327 votes

  5. Patriot Missile System • On February 25, 1991, the Patriot missile battery at Dharan, Saudi Arabia had been in operation for 100 hours, by which time the system's internal clock had drifted by one third of a second. For a target moving as fast as an inbound TBM, this was equivalent to a position error of 600 meters. • The radar system had successfully detected the Scud and predicted where to look for it next, but because of the time error, looked in the wrong part of the sky and found no missile. With no missile, the initial detection was assumed to be a spurious track and the missile was removed from the system. No interception was attempted, and the missile impacted on a barracks killing 28 soldiers.

  6. Mars Polar Lander • The last telemetry from Mars Polar Lander was sent just prior to atmospheric entry on December 3, 1999. No further signals have been received from the lander. The cause of this loss of communication is unknown. • According to the investigation that followed, the most likely cause of the failure of the mission was a software error that mistakenly identified the vibration caused by the deployment of the lander's legs as being caused by the vehicle touching down on the Martian surface, resulting in the vehicle's descent engines being cut off whilst it was still 40 meters above the surface, rather than on touchdown as planned. • Another possible reason for failure was inadequate preheating of catalysis beds for the pulsing rocket thrusters

  7. Ariane 5 Rocket • June 4, 1996 was the first test flight of the Ariane 5 launch system. The rocket tore itself apart 37 seconds after launch, making the fault one of the most expensive computer bugs in history. • The Ariane 5 software reused the specifications from the Ariane 4, but the Ariane 5's flight path was considerably different and beyond the range for which the reused code had been designed. Specifically, the Ariane 5's greater acceleration caused the back-up and primary inertial guidance computers to crash, after which the launcher's nozzles were directed by spurious data. Pre-flight tests had never been performed on the re-alignment code under simulated Ariane 5 flight conditions, so the error was not discovered before launch. • Because of the different flight path, a data conversion from a 64-bit floating point to 16-bit signed integer caused a hardware exception (more specifically, an arithmetic overflow, as the floating point number had a value too large to be represented by a 16-bit signed integer). Efficiency considerations had led to the disabling of the exception handler for this error. This led to a cascade of problems, culminating in destruction of the entire flight.

  8. 2003 North America Blackout August 14, 2003 • 12:15 p.m. Inaccurate data input renders a system monitoring tool in Ohio ineffective. • 1:31 p.m. The Eastlake, Ohio, generating plant shuts down. • 2:02 p.m. First 345-kV line in Ohio fails due to contact with a tree in Walton Hills, Ohio. • 2:14 p.m. An alarm system fails at FirstEnergy's control room and is not repaired. • 2:27 p.m. Second 345-kV line fails due to tree. • 3:05 p.m. A 345-kV transmission line fails in Parma, south of Cleveland due to a tree. • 3:17 p.m. Voltage dips temporarily on the Ohio portion of the grid. Controllers take no action, but power shifted by the first failure onto another 345-kV power line causes it to sag into a tree. While Mid West ISO and FirstEnergy controllers try to understand the failures, they fail to inform system controllers in nearby states. • 3:39 p.m. A First Energy 138-kV line fails. • 3:41 and 3:46 p.m. Two breakers connecting FirstEnergy’s grid with American Electric Power are tripped as a 345-kV power line and 15 138-kV lines fail in northern Ohio. Later analysis suggests that this could have been the last possible chance to save the grid if controllers had cut off power to Cleveland at this time. • 4:06 p.m. A sustained power surge on some Ohio lines begins uncontrollable cascade after another 345-kV line fails. • 4:09:02 p.m. Voltage sags deeply as Ohio draws 2 GW of power from Michigan. • 4:10:34 p.m. Many transmission lines trip out, first in Michigan and then in Ohio, blocking the eastward flow of power. Generators go down, creating a huge power deficit. In seconds, power surges out of the East, tripping East coast generators to protect them, and the blackout is on. • 4:10:37 p.m. Eastern Michigan grid disconnects from western part of state. • 4:10:38 p.m. Cleveland separates from Pennsylvania grid. • 4:10:39 p.m. 3.7 GW power flow from East through Ontario to southern Michigan and northern Ohio, more than ten times larger than the condition 30 seconds earlier, causing voltage drop across system. • 4:10:40 p.m. Flow flips to 2 GW eastward from Michigan through Ontario, then flip westward again in a half second. • 4:10:43 p.m. International connections begin failing. • 4:10:45 p.m. Western Ontario separates from east when power line north of Lake Superior disconnects. First Ontario plants go offline in response to unstable system. Quebec is protected because its lines are DC, not AC. • 4:10:46 p.m. New York separates from New England grid. 4:10:50 p.m. Ontario separates from Western New York grid. • 4:11:57 p.m. Last lines between Michigan and Ontario fail. • 4:13 p.m. End of cascade. 256 power plants are off-line. 85% went offline after the grid separations occurred, most of them on automatic controls. 50 million people without power.

  9. Therac-25 - the problem • When operating in soft X-ray mode, the machine was designed to rotate three components into the path of the electron beam, in order to shape and moderate the power of the beam. … • The accidents occurred when the high-energy electron-beam was activated without the target having been rotated into place; the machine's software did not detect that this had occurred, and did not therefore determine that the patient was receiving a potentially lethal dose of radiation, or prevent this from occurring.

  10. Therac-25 - the reasons • The design lacked hardware interlocks to prevent the electron-beam from operating in its high-energy mode without the target in place. • The engineer had reused software from older models. These models had hardware interlocks and were therefore not as vulnerable to the software defects. • The hardware provided no way for the software to verify that sensors were working correctly. • The equipment control task did not properly synchronize with the operator interface task, so that race conditions occurred if the operator changed the setup too quickly. This was evidently missed during testing, since it took some practice before operators were able to work quickly enough for the problem to occur. • The software set a flag variable by incrementing it. Occasionally an arithmetic overflow occurred, causing the software to bypass safety checks.

  11. Question What are the common features of those famous failures?

  12. Software Warranties DISCLAIMER OF WARRANTIES. TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLE LAW, MICROSOFT AND ITS SUPPLIERS PROVIDE TO YOU THE SOFTWARE COMPONENT, AND ANY (IF ANY) SUPPORT SERVICES RELATED TO THE SOFTWARE COMPONENT ("SUPPORT SERVICES") AS IS AND WITH ALL FAULTS; AND MICROSOFT AND ITS SUPPLIERS HEREBY DISCLAIM WITH RESPECT TO THE SOFTWARE COMPONENT AND SUPPORT SERVICES ALL WARRANTIES AND CONDITIONS, WHETHER EXPRESS, IMPLIED OR STATUTORY, INCLUDING, BUT NOT LIMITED TO, ANY (IF ANY) WARRANTIES OR CONDITIONS OF OR RELATED TO: TITLE, NON-INFRINGEMENT, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, LACK OF VIRUSES, ACCURACY OR COMPLETENESS OF RESPONSES, RESULTS, LACK OF NEGLIGENCE OR LACK OF WORKMANLIKE EFFORT, QUIET ENJOYMENT, QUIET POSSESSION, AND CORRESPONDENCE TO DESCRIPTION. THE ENTIRE RISK ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE COMPONENT AND ANY SUPPORT SERVICES REMAINS WITH YOU.

  13. Warranty Laws • Article 2 of the Uniform Commercial Code • What specifically is at issue in many cases are the disks you buy with software to load onto your computer or the updates which are internally loaded when you agree to provisions of what are called licensing agreements. Are these purchases/updates "transactions in goods" under the UCC Article 2? • At first the ALI and NCCUSL decided to handle this problem by a separate section of the UCC, which it would have called Article 2B. However, the ALI withdrew from the project when there seemed to be no attempt to bring all such transactions under the scope of Article 2. Thus, the remaining pieces of what was formerly 2B became a statute UCITA ("Uniform Computer Information Transactions Act"). Article 2 would then be revised to eliminate all reference to information and UCITA would carry the burden on that front. [In Article 2, the term "goods" does not include information.] UCITA was supposed to pick up the slack. But, UCITA ran into a lot of difficulties and only two states have approved it. Thus, that leaves us potentially in legal limbo regarding whether these software packages and other similar transactions are really Article 2 transactions. http://www.drbilllong.com/Sales/ScopeII.html

  14. Warranty Laws • Uniform Computer Information Transaction Act (UCITA)allows software manufacturers to: • disclaim all liability for defects • prevent the transfer of software from person to person • remotely disable licensed software during a dispute does not apply to embedded systems

  15. Warranty Lawsuits Mortenson v. Timeberline Software (≈1993) • Mortenson used a TS application when creating a bid to build a hospital. • The software created a bid that was $2M too low. • TS knew about the bug, but had not sent an update to Mortenson. • The State of Washington Supreme Court ruled in favor of TS.

  16. Liability Q: Can you be held criminally liable for your software's defects? A: Generally, no. • You are liable for embedded systems. • E.g., Toyota cruise control • Software Errors are usually covered by other laws: • COPPA - illegal to collect data on users under age 13 w/o parental consent • FERPA - protects student information • HIPAA - protects patients' information • FDA examines medical devices for safety

  17. How can we improve reliability? • Use of Software Engineering practices • Make software development more science and less art. • A science is only as mature as its measurement devices.

  18. Software Engineering, the early yearsThe "Software Crisis" P r o g r a m m e r s Demand Supply T i m e today 1960

  19. Software Engineering covered in CSCI 475/476 • Methods • e.g. how to test modules of code • Procedures / Best Practices • e.g. successful requirements engineering • Metrics • what do we do well? • what do we do poorly? • how productive are we?

  20. Software Quality Assurance covered in CSCI 521 Formal Peer Reviews Coding and Design Standards Employee Training Risk Management

  21. How much quality is cost effective? SQA + Failure Costs Development costs and SQA costs Cost of Failure Software Quality Optimal Quality Level

  22. How SQA Pays for Itself SQA + Failure Initial Cost of SQA Costs Eventual Cost of SQA Cost of Failure Software Quality Optimal Quality Level

  23. Next Classes Professional Codes of Ethics Your Project Proposals

More Related