1 / 28

Windows Hardware Error Architecture

Windows Hardware Error Architecture. John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation. Session Outline. Motivation and current limitations Goals and non-goals Architectural overview Platform requirements

adamdaniel
Download Presentation

Windows Hardware Error Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Windows Hardware Error Architecture John Strange Software Design EngineerCore OSJohnStra @ microsoft.com Microsoft Corporation

  2. Session Outline • Motivation and current limitations • Goals and non-goals • Architectural overview • Platform requirements • Baseboard Management Controller (BMC) integration

  3. Session Goals • Attendees should leave this session with a good understanding of the following: • Windows support for reporting on and recovering from hardware errors • What WHEA is • What will be required of platforms to support WHEA • How to develop software components to participate in and extend WHEA • Where to find resources for WHEA

  4. Motivation and Current Limitations

  5. Motivation – Why WHEA? • According to current OCA data, approximately 7-10% of all reported crashes are due to hardware errors (i.e. processor, memory, cache, etc.), and these numbers do not include NMI crashes • Determining the root cause of these types of crashes from crash data is not possible due to lacking infrastructure and insufficient error information • The OS is poorly positioned to prevent these types of crashes due to lacking infrastructure • No common error record format • Little support for error management software • Existing error management implementations are proprietary

  6. Motivation – Current Limitations • No coordinated OS hardware error infrastructure • Processor architectures have distinct error signaling and reporting mechanisms (e.g., Itanium vs. x64) • OS does not effectively leverage platform-specific capabilities • Very poor support for IO hardware errors (i.e. NMI) • Lack of a common OS error record format limits OS participation • Disparate error sources with distinct error signaling and reporting mechanisms • PCI/PCI-X results in non-maskable Interrupt (NMI) on x86/x64 and a BERR on Itanium • OS/FW integration is architecture-specific • No common mechanism for discovery of hardware error sources

  7. Goals and Non-Goals

  8. Goals • Reduce mean time to recovery for fatal hardware errors through richer error reporting • Lower hardware error related crash count through effective OS hardware error recovery and health monitoring • Enable powerful error management applications • Position Windows to effectively utilize existing and future hardware error standards • Itanium Machine Check Architecture (MCA) • PCI Express Advance Error Reporting (AER)) • Offer new alternatives for platform and BIOS vendors’ hardware error implementations

  9. Non-Goals • Initial WHEA implementation will not extend to peripheral device stacks • Initial focus is on platform hardware devices: processor, memory, cache, system interconnects (i.e. PCI/PCI-X/PCI Express) • Peripheral device errors to remain under the control of their respective device drivers • Future implementations expected to include support for bus drivers (i.e. USB, 1394, Fibre Channel, and SCSI)

  10. Architectural Overview

  11. Architecture – What is WHEA? • Common OS hardware error handling infrastructure • Entails the following: • Generic error source discovery mechanism • Common OS hardware error record format • Common OS hardware error handling flow • OS error record persistence mechanism • Common ETW-based hardware error eventing model for management applications

  12. Architecture – Overview

  13. Architecture – WheaReportHwError • New kernel API represents the entry point to the OS’s common hardware error handling flowNTSTATUSWheaReportHwError( __inout PWHEA_ERROR_PACKET Packet ); • This is the way kernel-mode components report hardware errors to the system • Constructs WHEA_ERROR_RECORD • Depending on the severity of the condition, either bugchecks the system or generates error events to notify consumers

  14. Architecture PSHED - Platform Specific Hardware Error Driver • Abstracts platform hardware error signaling and reporting mechanisms • Interfaces with platform hardware and/or firmware • Exposes a consistent interface to the platform’s hardware error mechanisms for the OS • Required component; always installed • Microsoft will implement a default PSHED for each processor architecture with basic functionality • Microsoft may implement enhanced PSHED for key platforms • Plug-in model enables PSHED extensibility • Augment/override default behavior • Offers new alternatives relative to the positioning of hardware error handling code (e.g. in firmware vs. OS)

  15. ArchitectureLLHEH - Low-Level Hardware Error Handler • OS Code that first handles error condition • May be an interrupt or a polling routine • May live in kernel, HAL, or device driver • Limited to extracting only architectural or standardized error data • Architectural machine check error banks • PCI Express error data from standardized registers • Creates and fills in a WHEA_ERROR_PACKET • Calls PshedRetrieveErrorInfo to give the PSHED an opportunity to extract platform-specific error information • Calls WheaReportHwError to report the error to the system

  16. Architecture Common OS Error Record • OS creates a common error record for all reported hardware errors • Kernel generates Event Tracing for Windows (ETW) events for consumers • ETW events contain the error record data • If OS must bugcheck due to a fatal error, it ensures that the error record is written to persistent storage before bugchecking • Generalized record format allows generic processing of error records • Error record is extensible through non-structured data sections • Enables value-add through private error data

  17. Architecture OS Common Error Record Format • Record Header provides top-level descriptionof the error condition • Timestamp • Severity • Number of sections • Section Descriptor provides key information about a section • Error Source • Length • Offset • Section Header indicates what type of information is available in the section body • Structured error data • Unstructured error data • Section Body holds the error data

  18. ArchitectureError Handling Flow OS Error Build Record PshedFinalizeRecord Record PshedAttemptRecovery N Contained ? PshedSaveErrRecord Y Finish Record OS Error Plug - ins Record Process Error PshedRetrieveErrorInfo PSHED Y Recovered ? Error Pkt Notify N OS Error Record End Save Record Signal Bugcheck LLHEH PshedReportHwErr ( INTx , NTOS MC #, etc .)

  19. Platform Requirements

  20. Platform RequirementsError Source Discovery • Today, the OS hard codes errors sources using a priori knowledge • WHEA will require support from the platform to discover the hardware error sources available • Current thinking is that this will be a new ACPI table created by BIOS • The OS will create abstractions for these error sources that enable status and configuration information to be retrieved and controlled • Some error source parameters (i.e. error thresholds) will not be generally adjustable • The OEM will decide the appropriate values for these kinds of platform-specific error source parameters

  21. Platform Requirements Error Record Serialization • The OS will require that the platform implements a serialization interface through which it will store and retrieve error records • Platform chooses where records are stored • Only requirement is that the store is persistent across sessions • Platform may hook the interface to store records for its own use • Current thinking is, for x86/x64 platforms, minimum storage requirement will be 1 kilobyte, which is enough to hold at least one error record

  22. BMC Integration

  23. BMC Integration • Platforms with existing BMC support may continue to use their implementations unchanged • OS will ensure that duplicate error reports do not occur • Platform may hook error record serialization interface to record WHEA error records in the BMC or it may send the WHEA error records over a network to another system

  24. Call To Action • Send us your questions • Watch for WHEA logo requirements for Longhorn • Evaluate how your products will integrate with WHEA

  25. Community Resources • Windows Hardware & Driver Central (WHDC) • www.microsoft.com/whdc/default.mspx • Technical Communities • www.microsoft.com/communities/products/default.mspx • Non-Microsoft Community Sites • www.microsoft.com/communities/related/default.mspx • Microsoft Public Newsgroups • www.microsoft.com/communities/newsgroups • Technical Chats and Webcasts • www.microsoft.com/communities/chats/default.mspx • www.microsoft.com/webcasts • Microsoft Blogs • www.microsoft.com/communities/blogs

  26. Additional Resources • Email: Send feedback and questions toWHEAFB @ microsoft.com • Related Sessions • TWAR05009: Error Management Solutions Synergy with WHEA

  27. © 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

More Related