1 / 52

Microprocessor system architectures – IA32 advanced features and rests

Microprocessor system architectures – IA32 advanced features and rests. Jakub Yaghob. Multiple-processor management. Mechanisms Support for atomic operations on system memory Serializing instructions APIC L2 and L3 caches Hyper-threading Aims Maintain system memory coherence

ingo
Download Presentation

Microprocessor system architectures – IA32 advanced features and rests

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microprocessor system architectures – IA32 advanced features and rests Jakub Yaghob

  2. Multiple-processor management • Mechanisms • Support for atomic operations on system memory • Serializing instructions • APIC • L2 and L3 caches • Hyper-threading • Aims • Maintain system memory coherence • Maintain cache coherence • Predictable ordering of writes to memory • Distribute interrupt handling among processors • Increase system performance by exploiting multi-threaded OSs and applications

  3. Locked atomic operations • Three independent mechanisms • Guaranteed atomic operations • Bus locking usingLOCK#or instruction prefixLOCK • Cache coherency protocols insuring cache coherency for atomic operations on cached data (cache lock) (Pentium Pro+)

  4. Guaranteed atomic operations • i486+ • R/W a byte • R/W a word (2B) aligned on a word • R/W a dword (4B) aligned on a dword • Pentium+ • R/W a qword (8B) aligned on a qword • R/W a word from/to uncached memory within 32-bit bus • Pentium Pro+ • Unaligned word, dword, qword R/W from/to cached memory within a cache line

  5. Bus locking • Automatic locking • XCHGwith memory • SettingB (busy) flag of a TSS descriptor • Updating descriptors (e.g.A flag) • Updating page tables • Interrupt acknowledgement • Software controlled locking (prefix LOCK) • Automatically assumed forXCHG • BTS, BTC, BTR • XADD, CMPXCHG, CMPXCHG8B • INC, DEC, NOT, NEG, ADD, ADC, SUB, SBB, AND, OR, XOR • Otherwise #UD exception (invalid opcode) • Memory access can be unaligned • Pentium Pro+ serializes locked operations

  6. Self-modifying code • Option 1 • Write modified code using data segment • Jump to new code or an intermediate location • Execute the new code • Option 2 • Write modified code using data segment • Execute a serializing instruction • Execute the new code • Required for Pentium Pro+ • Performance penalty • Cross-modifying code • One CPU changes a code and the second one executes it • Synchronize CPUsand execute a serializing instruction

  7. Memory ordering • Program-ordering • Alias strong-ordering • R/W issued on the bus in the order they occur in the instruction stream under all circumstances • i386 • Processor-ordering • Alias speculative-ordering or weak-ordering • Allows increased instruction execution speed, while maintaining memory coherency • The exact behavior depends on a model; Pentium Pro+ • Pentium and i486 • They use processor-ordering • In most cases they behave as program-ordered • R miss goes ahead of W, when all buffered W are cache hits • I/O always in the order of instruction stream (strong-ordering)

  8. Processor-ordering I. • Single-processor and WB memory • R can be carried out speculatively and in any order • R can pass buffered W, but the CPU is self-consistent • W to memory are always carried out in program order, excluding instructions CLFLUSH, MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, MOVNTPD • W can be buffered • W are not speculative; performed only for really executed (retired) instructions • Data from buffered W can be passed to waiting R within the CPU • R/W cannot pass I/O, locked or serializing instructions • R cannot pass LFENCE and MFENCE • W cannot pass SFENCE and MFENCE • Multiple CPUs • Individual CPUs behave as single-processor • Writes by a single CPU are observed in the same order by all CPUs • Writes from the individual CPUs on the bus are NOT ordered with respect to each other

  9. Processor-ordering II.

  10. „Fast string“ operation • „Fast string“ • Pentium Pro+ • MOVSorSTOS • CPU works with cache lines • Reads are not performed during cache line writes • Interrupts only on the cache line border • Conditions • EDIandESIaligned to 8B (PIII), EDIaligned to 8B (P4) • Ascending order (DF=0) • Initial counterECX>=64 • Source and target most not overlap by less then one cache line (64B forP4+, 32B other) • Memory type WC or WB

  11. Strengthening or weakening memory ordering • Strengthening • I/O instructions, locked instructions, LOCKand serializing instructions • SFENCE (PIII), LFENCEandMFENCE (P4+) • SFENCE – all W finished before this instruction • LFENCE – all R finished before this instruction • MFENCE – all R and W finished before this instruction • PAT (Page Attribute Table) strengthens orderingfor pages (PIII+) • Weakening or strengthening • MTRR (Memory Type Range Registers) weaken or strengthenorderingfor physical memory regions (Pentium Pro+)

  12. Serializing instructions • CPU finishes all flags, registers and memory changes • CPU clears all buffered W • Pentium+ • Privileged instructions • MOVCRx, MOVDRx, WRMSR, INVD, INVLPG, WBINVD, LGDT, LIDT, LTR • Non-privileged instructions • CPUID, IRET, RSM • Non-privileged for memory ordering • LFENCE, SFENCE, MFENCE

  13. Propagation of page table entry changes • „TLB shootdown“ • Simple method • Send IPI to all CPUs • Stop all CPUs excluding one (spin-lock) • Active CPU makes the changes (invalidates page tables in memory) and resumes all CPUs • All CPUs invalidates their TLB (selectively or all entries) • All CPUs return from IPI • Complicated and faster methods can be developed • Different TLB mappings are not used on different CPUs during the update • The OS must be prepared for a situation where CPUs use stale mapping during the update

  14. MPS 1.4 • Multiprocessor Specification • Controlled booting of multiple CPUs without a dedicated HW • HW can initiate a boot without a dedicated signal or a predefined boot CPU • All IA-32 CPUs have the same boot protocol (including HT) • Different mechanisms for different CPU models (P4 x Xeon older x Xeon newer) • BSP = Bootstrap Processor • AP = Application Processor

  15. Detecting hyper-threading or multi-core • Hardware Multi-Threading feature flag • CPUID.1:EDX[28] = 1 • Logical processors per Package • CPUID.1:EBX[23:16] • Cores per Package • Only when CPUID works with EAX=4, otherwise it has 1 core • CPUID.(EAX=4,ECX=0):EAX[31:26]+1

  16. Hyper-threading– I • One core is able to execute 2 or more instruction streams • Some parts of a core are private for each logical processor, some parts are shared among logical processors

  17. Private state of a logical processor General purpose registers EAX-ESP (RAX-RSP, R8-R15) Segment registers CS-SS EFLAGS and EIP (RIP) x87 (ST0-ST7), MMX (MM0-MM7), SSE (XMM0-XMM7/XMM15) and their control and status registers Control registers CRx, GDTR, IDTR, LDTR, IA32_EFER Debug registers DRx Time stamp Most of MSRs (including PAT) Local APIC Instruction TLB Shared state MTRR Data TLB Cache, the bus Some MSRs Hyper-threading – II

  18. Multi-Core

  19. Programming MT-capable CPUs – I • Requires support from OS • Using PAUSE instruction in spin-lock • Encoded as REPNOP • Older IA-32 CPUs interpret PAUSE as NOP • Older AMD CPUs do NOT understand it • UsingHLT • Idle logical processor must use HLT and must not actively wait • UsingMONITOR/MWAIT • SSE3, check CPUID.1.ECX[3] = 1, available only for CPL=0 • MONITOR sets up a memory range monitored for W • MWAIT places the processor in an optimized state until a W to the monitored range occurs

  20. Programming MT-capable CPUs – II • Scheduling • Dispatch tasks to logical processors 0 for all cores, then to logical processors 1, etc. • Use thread affinity • Do not measure the speed of a CPU by an active loop • One lock or semaphore should be placed aligned into 128B block of memory

  21. APIC (Advanced Programmable Interrupt Controller) • Local APIC • Internal in CPUs • Receives interrupts from CPU’s interrupt pins, from internal sources and from an external I/O APIC • Sends and receives IPI (InterProcessor Interrupt) • I/O APIC • Part of a chipset • Receives external interrupts and relays them to a local APIC • Possibility of IPI distribution among CPUs • xAPIC • Newer architecture • EXtended APIC • P4 and Xeons

  22. APIC – xAPIC • xAPIC system (P4 and Xeon)

  23. APIC – „traditional“ APIC • APIC system (Pentium and Pentium Pro+)

  24. Local APIC structure

  25. Internal cache • Cache structure of P4 and Xeon

  26. Characteristics of caches

  27. Cache terminology • Cache use MESI protocol for maintaining coherency • Cache line fill • An operand is read from cacheable memory • The entire cache line is read • Cache hit • An operand is in a cache • An access uses a value from a cache • Cache miss • An operand is not in a cache • Write hit • If a valid cache line exists, CPU can write into the cache • If a write misses a cache, cache line fill occurs • Snooping • CPU checks memory accesses on the bus with its cache lines

  28. MESI • Each cache line has 2 status bits • Transparent for programs • Instruction L1 has only SI • Transition by snooping • CPU detects W to the line with M • Cancel transaction • W line directly to the other CPU with branch to the memory • Moving to the I state

  29. Cache control • CR0[CD] • =0 – caching enabled for the whole of system memory, can be restricted for regions or pages • =1 – caching disabled for Pentium, for other restricted • CR0[NW] • =0 – WB enabled, can be restricted • =1 –WB disabled • PCDandPWTin the page tables and directories • Disablecaching/WB for pages or page directories • PCDandPWTin theCR3 • Disablecaching/WB for page directories • Gin the page tables (Pentium Pro+) • Does not flush TLB entry during implicit flushing (task switch, mov cr3,eax) • CR4[PGE] (Pentium Pro+) • EnablesGin page tables • MTRR (Pentium Pro+) • Memory types for regions of physical memory • PAT (PIII+) • Memory types for pages

  30. Store buffers • IA-32 stores temporarily each W to memory in a store buffer • CPU continues without waiting on the memory or a cache • Transparent forsoftware • Draining store buffers • An interrupt or an exception • Serializing instruction (Pentium Pro+) • I/O operation • LOCKoperation • BINIT operation (Pentium Pro+) (machine check) • SFENCEinstruction (PIII+) • MFENCEinstruction (P4+)

  31. Memory types – an overview • Pentium has UC, WT, WB • Control using NW, CD • UC- from PIII with PAT

  32. Memory types – I • Strong uncacheable (UC) • The system memory is not cached • All R/W havestrong-ordering, no speculation • Useful for memory-mapped I/O • Greatly reduces system performance • Uncacheable (UC-) • Like UC, can be overridden to WC using MTRR • Only PIII+using PAT • Write Combining (WC) • The system memory is not cached • No coherency protocol • Speculative R enabled, W ordering isNOTensured • W delayedandcombined in WC buffers • Useful for video framebuffers

  33. Memory types – II • Write Through (WT) • R/W from/to the system memory cached • R comes from a cache on cache hit; cache line fills on cache miss; speculative R • W writes to a cache and the main memory on cache hit; does not write to the cache on cache miss • WC enabled • Useful for video frame buffers or devices without snooping • Write Back (WB) • R/W from/to the system memory cached • R comes from a cache on cache hit; cache line fills on cache miss; speculative R • W writes to a cache and the main memory on cache hit;cache line fill on cache miss • Cache coherency protocol • Write Protected (WP) • R comes from a cache on cache hit; cache line fills on cache miss; speculative R • W directly propagated on the system bus

  34. MTRR (Memory Type Range Registers) • Assigning memory types to the physical memory regions • Checking MTRR presence usingCPUID • MSR R/O registr IA32_MTRRCAP • Support for fixed ranges • Number of variable ranges (Pentium Pro+) • Support for WC type • Default type • MSR IA32_MTRR_DEF_TYPEdefines memory type for physical memory not covered by fixed and variable ranges • Fixed ranges • 8 rangesof 64K size in the lowest 512K (00000000-0007FFFF) • 16 ranges of 16K size in the next 256K (00080000-000BFFFF) • 64 ranges of 4K size in the next 256K (000C0000-000FFFFF) • Variable ranges • Address&PHYSMASKn = PHYSBASEn&PHYSMASKn • When a variable range overlaps with a fixed range, the fixed range wins

  35. PAT (Page Attribute Table) • Assigning memory type to the ranges of linear address space • Checking PAT presence using CPUID • MSR IA32_CR_PATdefines 8 types • The type for a page is selected fromIA32_CR_PATby an index created from PAT(4), PCD(2), PWT(1) bits in page tables • It is always switched on • The initial setting after RESET is backward compatible withPCDandPWT – 2 * (WB, WT, UC-, UC)

  36. Memory types restrictions • IfCR0[CD]=1, then caching is disabled • IfCR0[CD]=0, then caching restricted using PAT (orPCDandPWT) and MTRR • Always selected the most restrictive type • WT „wins“ over WB • WC „wins“ over WT and WB

  37. Reset • Sets a CPU to the well known state • CPU in the real mode • Internal caches, TLB and BTB invalidated • CPU model dependent behavior • Pentium Pro+ • All CPUs start initialization protocol, on of them is chosen as BSP and continues in an OS initialization, all other APshalt and wait for an IPI „Wait for Startup“ • i486 and Pentium • HW knows, which CPU is BSP, other APshalt and wait on SIPI • INIT • Like RESET • Internal caches, MSR, MTRR, x87, SSE do not change • Move to the real mode

  38. CPU state after RESET, INIT and power-up

  39. Microcode update • Pentium Pro+ has an interface for uploading microcode block with patches to the CPU • Microcode block is supplied by Intel directly to the BIOS vendors • Microcode block has a header with CPU model specification • Checking CPU model in the microcode header with current CPU • A microcode must be uploaded before L2 is enabled and lot of other constraints (e.g.segment limit exceeding)

  40. Virtual machine extensions (VMX) • Two classes of software • Virtual machine monitor (VMM) • Acts like a host • Full control of HW • Presents abstract HW to guests • Guest software • Guest software environment with OS and applications

  41. Virtual-machine control data structure (VMCS) – I • VMX non-root operation and VMX transitions controlled by a VMCS • Access through the VMCS pointer (one per logical CPU) • Changing the pointer using VMPTRST and VMPTRLD instructions • VMCS configuration using VMREAD, VMWRITE, VMCLEAR instructions • VMM could use a different VMCS for each virtual CPU • Each logical CPU associates a physical memory region (one 4KB frame) with each VMCS

  42. Virtual-machine control data structure (VMCS) – II • VMCS state • Inactive • after VMCLEAN • Active • Memory region after VMPTRLD • Maintains CPU state • Current • VMPTRLD loads current VMCS • VMLAUNCH, VMPTRST, VMREAD, VMRESUME and VMWRITE operate with current VMCS

  43. Virtual-machine control data structure (VMCS) – III • VMCS data • Guest-state area • CPU state is saved on VM exits and loaded from there on VM entries • Host-state area • CPU state is loaded on VM exits • VM-execution control fields • VM-exit control fields • VM-entry control fields • VM-exit information fields

  44. Guest-state area • Registers • CR0, CR3, CR4 • RSP, RIP, RFLAGS • CS, DS, ES, FS, GS, SS, LDTR, TR • Selector and part of internal cache • GDTR, IDTR • MSRs • IA32_DEBUGCTL, IA32_SYSENTER_CS, IA32_SYSENTER_ESP, IA32_SYSENTER_EIP • Activity state • Active, HLT, shutdown, wait-for-SIPI • Interruptibility state • Blocking by STI, MOV SS, NMI, SMI • Pending debug exceptions • VMCS link pointer

  45. Host-state area • Registers • CR0, CR3, CR4 • RSP, RIP • CS, DS, ES, FS, GS, SS, TR • Base address for FS, GS, TR, GDTR, IDTR • MSRs • IA32_SYSENTER_CS, IA32_SYSENTER_ESP, IA32_SYSENTER_EIP

  46. VM-execution control fields • Pin-based VM-execution controls • VM-exits on external interrupt or NMI • CPU-based VM-execution controls • Instructions and events causing VM-exits • Exception bitmap • I/O-bitmap addresses • Guest/host masks and read shadows for CR0 and CR4 • CR3 target controls • 4 target addresses+counter • CR8 access control • MSR bitmap address

  47. VM-exit control fields • VM-exit controls • Basic operation of VM-exit • VM-exit controls for MSRs • List of MSRs stored and loaded on VM-exit

  48. VM-entry control fields • VM-entry controls • Basic operation on VM-entry • VM-entry controls for MSRs • List of MSRs to be loaded on VM-entry • Event injection • “Executed” before the first guest-mode instruction • Interrupts, exceptions including error-code

  49. VM-exit information fields • Basic VM-exit information • Exit reason, exit qualification • Vectored events • Interrupts, exceptions • VM-exits during event delivery • VM-exits due to instruction execution • Instruction address, length, detailed information

  50. VMXON region • Physical memory region (4KB frame) for VMX operation • Operand of VMXON instruction

More Related