310 likes | 858 Views
Windows Hardware Error Architecture. John Strange Software Design Engineer Core OS JohnStra @ microsoft.com Microsoft Corporation. Session Outline. Motivation and current limitations Goals and non-goals Architectural overview Platform requirements
E N D
Windows Hardware Error Architecture John Strange Software Design EngineerCore OSJohnStra @ microsoft.com Microsoft Corporation
Session Outline • Motivation and current limitations • Goals and non-goals • Architectural overview • Platform requirements • Baseboard Management Controller (BMC) integration
Session Goals • Attendees should leave this session with a good understanding of the following: • Windows support for reporting on and recovering from hardware errors • What WHEA is • What will be required of platforms to support WHEA • How to develop software components to participate in and extend WHEA • Where to find resources for WHEA
Motivation – Why WHEA? • According to current OCA data, approximately 7-10% of all reported crashes are due to hardware errors (i.e. processor, memory, cache, etc.), and these numbers do not include NMI crashes • Determining the root cause of these types of crashes from crash data is not possible due to lacking infrastructure and insufficient error information • The OS is poorly positioned to prevent these types of crashes due to lacking infrastructure • No common error record format • Little support for error management software • Existing error management implementations are proprietary
Motivation – Current Limitations • No coordinated OS hardware error infrastructure • Processor architectures have distinct error signaling and reporting mechanisms (e.g., Itanium vs. x64) • OS does not effectively leverage platform-specific capabilities • Very poor support for IO hardware errors (i.e. NMI) • Lack of a common OS error record format limits OS participation • Disparate error sources with distinct error signaling and reporting mechanisms • PCI/PCI-X results in non-maskable Interrupt (NMI) on x86/x64 and a BERR on Itanium • OS/FW integration is architecture-specific • No common mechanism for discovery of hardware error sources
Goals • Reduce mean time to recovery for fatal hardware errors through richer error reporting • Lower hardware error related crash count through effective OS hardware error recovery and health monitoring • Enable powerful error management applications • Position Windows to effectively utilize existing and future hardware error standards • Itanium Machine Check Architecture (MCA) • PCI Express Advance Error Reporting (AER)) • Offer new alternatives for platform and BIOS vendors’ hardware error implementations
Non-Goals • Initial WHEA implementation will not extend to peripheral device stacks • Initial focus is on platform hardware devices: processor, memory, cache, system interconnects (i.e. PCI/PCI-X/PCI Express) • Peripheral device errors to remain under the control of their respective device drivers • Future implementations expected to include support for bus drivers (i.e. USB, 1394, Fibre Channel, and SCSI)
Architecture – What is WHEA? • Common OS hardware error handling infrastructure • Entails the following: • Generic error source discovery mechanism • Common OS hardware error record format • Common OS hardware error handling flow • OS error record persistence mechanism • Common ETW-based hardware error eventing model for management applications
Architecture – WheaReportHwError • New kernel API represents the entry point to the OS’s common hardware error handling flowNTSTATUSWheaReportHwError( __inout PWHEA_ERROR_PACKET Packet ); • This is the way kernel-mode components report hardware errors to the system • Constructs WHEA_ERROR_RECORD • Depending on the severity of the condition, either bugchecks the system or generates error events to notify consumers
Architecture PSHED - Platform Specific Hardware Error Driver • Abstracts platform hardware error signaling and reporting mechanisms • Interfaces with platform hardware and/or firmware • Exposes a consistent interface to the platform’s hardware error mechanisms for the OS • Required component; always installed • Microsoft will implement a default PSHED for each processor architecture with basic functionality • Microsoft may implement enhanced PSHED for key platforms • Plug-in model enables PSHED extensibility • Augment/override default behavior • Offers new alternatives relative to the positioning of hardware error handling code (e.g. in firmware vs. OS)
ArchitectureLLHEH - Low-Level Hardware Error Handler • OS Code that first handles error condition • May be an interrupt or a polling routine • May live in kernel, HAL, or device driver • Limited to extracting only architectural or standardized error data • Architectural machine check error banks • PCI Express error data from standardized registers • Creates and fills in a WHEA_ERROR_PACKET • Calls PshedRetrieveErrorInfo to give the PSHED an opportunity to extract platform-specific error information • Calls WheaReportHwError to report the error to the system
Architecture Common OS Error Record • OS creates a common error record for all reported hardware errors • Kernel generates Event Tracing for Windows (ETW) events for consumers • ETW events contain the error record data • If OS must bugcheck due to a fatal error, it ensures that the error record is written to persistent storage before bugchecking • Generalized record format allows generic processing of error records • Error record is extensible through non-structured data sections • Enables value-add through private error data
Architecture OS Common Error Record Format • Record Header provides top-level descriptionof the error condition • Timestamp • Severity • Number of sections • Section Descriptor provides key information about a section • Error Source • Length • Offset • Section Header indicates what type of information is available in the section body • Structured error data • Unstructured error data • Section Body holds the error data
ArchitectureError Handling Flow OS Error Build Record PshedFinalizeRecord Record PshedAttemptRecovery N Contained ? PshedSaveErrRecord Y Finish Record OS Error Plug - ins Record Process Error PshedRetrieveErrorInfo PSHED Y Recovered ? Error Pkt Notify N OS Error Record End Save Record Signal Bugcheck LLHEH PshedReportHwErr ( INTx , NTOS MC #, etc .)
Platform RequirementsError Source Discovery • Today, the OS hard codes errors sources using a priori knowledge • WHEA will require support from the platform to discover the hardware error sources available • Current thinking is that this will be a new ACPI table created by BIOS • The OS will create abstractions for these error sources that enable status and configuration information to be retrieved and controlled • Some error source parameters (i.e. error thresholds) will not be generally adjustable • The OEM will decide the appropriate values for these kinds of platform-specific error source parameters
Platform Requirements Error Record Serialization • The OS will require that the platform implements a serialization interface through which it will store and retrieve error records • Platform chooses where records are stored • Only requirement is that the store is persistent across sessions • Platform may hook the interface to store records for its own use • Current thinking is, for x86/x64 platforms, minimum storage requirement will be 1 kilobyte, which is enough to hold at least one error record
BMC Integration • Platforms with existing BMC support may continue to use their implementations unchanged • OS will ensure that duplicate error reports do not occur • Platform may hook error record serialization interface to record WHEA error records in the BMC or it may send the WHEA error records over a network to another system
Call To Action • Send us your questions • Watch for WHEA logo requirements for Longhorn • Evaluate how your products will integrate with WHEA
Community Resources • Windows Hardware & Driver Central (WHDC) • www.microsoft.com/whdc/default.mspx • Technical Communities • www.microsoft.com/communities/products/default.mspx • Non-Microsoft Community Sites • www.microsoft.com/communities/related/default.mspx • Microsoft Public Newsgroups • www.microsoft.com/communities/newsgroups • Technical Chats and Webcasts • www.microsoft.com/communities/chats/default.mspx • www.microsoft.com/webcasts • Microsoft Blogs • www.microsoft.com/communities/blogs
Additional Resources • Email: Send feedback and questions toWHEAFB @ microsoft.com • Related Sessions • TWAR05009: Error Management Solutions Synergy with WHEA
© 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.