1 / 95

IA-64 Application Architecture Tutorial

IA-64 Application Architecture Tutorial. Thomas E. Willis IA-64 Architecture and Performance Group Intel Corporation thomas.e.willis@intel.com. Objectives for This Tutorial. Provide background for some of the architectural decisions

Download Presentation

IA-64 Application Architecture Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IA-64 Application Architecture Tutorial Thomas E. Willis IA-64 Architecture and Performance Group Intel Corporation thomas.e.willis@intel.com

  2. Objectives for This Tutorial • Provide background for some of the architectural decisions • Provide a description of the major features of the IA-64 application architecture • Provide introduction and overview • Describe software and performance usage models • Mention relevant design issues • Show an example of IA-64 feature usage (C  asm)

  3. Agenda for this Tutorial • IA-64 History and Strategies • IA-64 Application Architecture Overview • C  IA-64 Example • Conclusions and Q & A

  4. IA-64 Definition History • Two concurrent 64-bit architecture developments • IAX at Intel from 1991 • Conventional 64-bit RISC • Wideword at HP Labs from 1987 • Unconventional 64-bit VLIW derivative • IA-64 definition started in 1994 • Extensive participation of Intel and HP architects, compiler writers, micro-architects, logic/circuit designers • Several customers also participated as definition partners • There are 3 generations of IA-64 microprocessors currently in different stages of design

  5. IA-64 Strategies I • Extracting parallelism is difficult • Existing architectures contain limitations that prevent sufficient parallelism on in-order implementations • Strategy • Allow the compiler to exploit parallelism by removing static scheduling barriers (control and data speculation) • Enable wider machines through large register files, static dependence specification, static resource allocation

  6. IA-64 Strategies II • Branches interrupt control flow/scheduling • Mispredictions limit performance • Even with perfect branch prediction, small basic blocks of code cannot fully utilize wide machines • Strategies • Allow compiler to eliminate branches (and increase basic block size) with predication • Reduce the number and duration of branch mispredicts by using compiler generated branch hints • Allow compiler to schedule more than one branch per clock - multiway branch

  7. IA-64 Strategies III • Memory latency is difficult to hide • Increasing relative to processor speed which implies larger cache miss penalties • Strategy • Allow the compile to schedule for longer latencies by using control and data speculation • Explicit compiler control of data movement through an architecturally visible memory hierarchy

  8. IA-64 Strategies IV • Procedure calls interrupt scheduling/control flow • Call overhead from saving/restoring registers • Software modularity is standard • Strategy • Reduce procedure call/return overhead • Register Stack • Register Stack Engine (RSE) • Provide special support for software modularity

  9. IA-64 Strategies Summary • Move complexity of resource allocation, scheduling, and parallel execution to compiler • Provide features that enable the compiler to reschedule programs using advanced features (predication, speculation) • Enable wide execution by providing processor implementations that the compiler can take advantage of

  10. Agenda for this Tutorial • IA-64 History and Strategies • IA-64 Application Architecture Overview • C  IA-64 Example • Conclusions and Q & A

  11. IA-64 Application Architecture Overview • Application State • Instruction Format and Execution Semantics • Integer, Floating Point, and Multimedia Instructions • Memory • Control Speculation and Data Speculation • Predication • Parallel Compares • Branch Architecture • Register Stack

  12. Application State General Registers FP Registers Predicate Registers Branch Registers gr0 = 0 0 fr0 = +0.0 pr0 = 1 br0 ... ... ... ... fr1 = +1.0 Static gr31 pr64 br7 ... gr32 ... 1b 64b Stacked ... CPU Identifier gr127 fr127 cpuid0 ... 64b 1b 82b Application Registers* cpuid<n> Instruction Pointer IP ar0 ... 64b 64b Performance Monitors Current Frame Marker CFM ar127 pmd0 ... 64b 64b *Only 27 populated User Mask UM pmd<m> 6b 64b

  13. IA-64 Application Architecture Overview • Application State • Instruction Format and Execution Semantics • Integer, Floating Point, and Multimedia Instructions • Memory • Control Speculation and Data Speculation • Predication • Parallel Compares • Branch Architecture • Register Stack

  14. Instruction types M: Memory I: Shifts and multimedia A: ALU B: Branch F: Floating point L+X: Long Template encodes types MII, MLX, MMI, MFI, MMF Branch: MIB, MMB, MFB, MBB, BBB Template encodes parallelism All come in two flavors: with and without stop at end Also, stop in middle: MI_I M_MI Instruction Formats: Bundles 128 bits Instruction 2 41 bits Instruction 1 41 bits Instruction 0 41 bits Template 5 bits • Template identifies types of instructions in bundle and delineates independent operations (through “stops”)

  15. Execution Semantics I • Traditional architectures have sequential semantics • The machine must always behave as if the instructions were executed in an unpipelined sequential fashion • If a machine actually issues instructions in a different order or issues more than one instruction at a time, it must insure sequential execution semantics are obeyed Case 1 – Dependent Case 2 – Independent add r1 = r2, r3 sub r4 = r1, r2 shl r2 = r4, r8 add r1 = r2, r3 sub r4 = r11, r21 shl r12 = r14, r8

  16. Execution Semantics II • IA-64 is an explicitly parallel (EPIC) architecture • Compiler uses templates with stops (“;;”) to indicate groups of independent operations known as “instruction groups” • Hardware need not check for dependent operations within instruction groups • WAR register dependencies allowed • Memory operations still require sequential semantics • Predication can disable dependencies dynamically Case 1 – Dependent Case 2 – Independent Case 3 – Predication add r1 = r2, r3;; add r1 = r2, r3 sub r4 = r11, r21 shl r12 = r14, r8 (p1) add r1 = r2, r3 (p2) sub r1 = r2, r3 ;; sub r4 = r1, r2 ;; shl r12 = r1, r8 shl r12 = r1, r8 1 Instruction group 2 Instruction groups 3 Instruction groups

  17. IA-64 Application Architecture Overview • Application State • Instruction Format and Execution Semantics • Integer, Floating Point, and Multimedia Instructions • Memory • Control Speculation, Data Speculation • Predication • Parallel Compares • Branch Architecture • Register Stack

  18. Integer Instructions • Memory – load, store, semaphore, ... • Arithmetic – add, subtract, shladd, ... • Compare – lt, gt, eq, ne, ..., tbit, tnat • Logical – and, and complement, or, exclusive-or • Bit fields – deposit, extract • Shift pair • Character • Shifts – left, right • 32-bit support – cmp4, shladdp4 • Move – various register files moves • No-ops

  19. Floating-point Architecture • IEEE 754 compliant • Single, double, double extended (80-bit) • Canonical representation in 82-bit FP registers • Fused multiply-add instruction • 128 floating-point registers • Rotating, not stacked • Load double/single pair • Multiple FP status registers for speculation

  20. Parallel FP Support • Enable cost-effective 3D Graphics platforms • Exploit data parallelism in applications using 32-bit floating-point data • Most application’s geometry calculations (transforms and lighting) are done with 32-bit floating-point numbers • Provides 2X increase in computation resources for 32-bit data parallel floating-point operations • Floating-point registers treated as 2x32 bit single precision elements • Full IEEE compliance • Single, double, double-extended data types, packed-64 • Similar instructions as for scalar floating-point • Availability of fast divide (non IEEE)

  21. Multimedia Support • Audio and video functions typically perform the same operation on arrays of data values • IA-64 provides instructions that treat general registers as 8 x 8, 4 x 16, or 2 x 32 bit elements; three types • Addition and subtraction (including special purpose forms) • Left shift, signed right shift, and unsigned right shift • Pack/unpack to convert between different element sizes. • Semantically compatible with IA-32’s MMX Technology™

  22. IA-64 Application Architecture Overview • Application State • Instruction Format • Execution Semantics • Integer, Floating Point, and Multimedia Instructions • Memory • Control Speculation and Data Speculation • Predication • Parallel Compares • Branch Architecture • Register Stack

  23. Memory • Byte addressable, accessed with 64-bit pointers • Ordered and unordered memory access instructions • Access granularity and alignment • 1, 2, 4, 8, 10, 16 bytes • Instructions are always 16-byte aligned • Supports big- or little-endian byte order for data accesses • Alignment on natural boundaries is recommended for best performance

  24. Memory Hierarchy Control • Explicit control of cache allocation and deallocation • Specify levels of the memory hierarchy affected by the access • Allocation and flush resolution is at least 32-bytes • Allocation • Allocation hints indicate at which level allocation takes place • But always implies bringing the data close to the CPU • Used in load, store, and explicit prefetch instructions • Deallocation and flush • Invalidates the addressed line in all levels of cache hierarchy • Write data back to memory if necessary

  25. IA-64 Application Architecture Overview • Application State • Instruction Format • Execution Semantics • Integer, Floating Point, and Multimedia Instructions • Memory • Control Speculation and Data Speculation • Predication • Parallel Compares • Branch Architecture • Register Stack

  26. Control and Data Speculation • Two kinds of instructions in IA-64 programs • Non-Speculative Instructions – known to be useful/needed • Would have been executed in the original program • Speculative instructions – may or may not be used • Schedule operations before results are known to be needed • Usually boosts performance, but may degrade • Heuristics can guide compiler in aggressiveness • Need profile data for maximum benefit • Two kinds of speculation • Control and Data • Moving loads up is a key to performance • Hide increasing memory latency • Computation chains frequently begin with loads

  27. Control Speculation Original: (p1) br.cond ld8 r1 = [ r2 ] Transformed: ld8.s r1 = [ r2 ] . . . (p1) br.cond . . . chk.s r1, recover Data Speculation Original: st4 [ r3 ] = r7 ld8 r1 = [ r2 ] Transformed: ld8.a r1 = [ r2 ] . . . st4 [ r3 ] = r7 . . . chk.a r1, recover Speculation Separates loads into 2 parts: speculative loading of data and conflicts/fault detection

  28. Control Speculation • Example: • Suppose br.cond checks if r2 contains a null pointer • Suppose the code then loads r1 by dereferencing this pointer and then uses the loaded value • Normally, the compiler cannot reschedule the load before the branch because of potential fault • Control speculation is ... • Moving loads (and possibly instructions that use the loaded values) above branches on which their execution is dependent • Three steps... Traditional Architectures instr1 instr2 ;; … br.cond null ;; ld8 r1 = [r2] ;; add r3 = r1, r4 null: ...

  29. Control Speculation: Step 1 • Separate load behavior from exception behavior • The ld.s which defers exceptions • The chk.s which checks for deferred exceptions • Exception token propagates from ld.s to chk.s • NaT bits in General Registers, NaTVal (Special NaN value) in FP Registers instr1 instr2 ;; … br.cond null ;; instr1 instr2 ;; … br.cond null ;; ld8 r1 = r2 ;; add r3 = r1, r4 null: ... ld8.s r1 = [r2] ;; chk.s r1, recover add r3 = r1, r4 null: ...

  30. Control Speculation: Step 2 • Reschedule load and (optionally) use • Now, ld8.s will defer a fault and set the NaT bit on r1 • The chk.s checks r1’s NaT bit and branches/faults if needed • Allows faults to propagate • NaT bits in General Registers, NaTVal (Special NaN value) in FP Registers instr1 instr2 ;; … br.cond null ;; ld8.s r1 = [r2] instr1 instr2 ;; … add r3 = r1, r4 br.cond null ;; ld8.s r1 = [r2] ;; chk.s r1, recover add r3 = r1, r4 null: ... chk.s r1, recover null:

  31. Control Speculation: Step 3 • Add recovery code • All instructions speculated are repeated in a recovery block • Will only execute if the chk.s detects a NaT • Use non-speculative versions of instructions ld8.s r1 = [r2] instr1 instr2 ;; … add r3 = r1, r4 br.cond null ;; ld8.s r1 = r2 instr1 instr2 ;; … add r3 = r1, r4 br.cond null ;; “null” block elided... chk.s r1, recover null: chk.s r1, recover back: ... ld8 r1 = [r2] ;; add r3 = r1, r4 br back

  32. Transform Reschedule / Recovery Original load use load load chk.s use use chk.s recover Control Speculation Summary Instruction(s)

  33. NaT Propagation • All computation instructions propagate NaTs to reduce the number of checks required ld8.s r3 = [r9] ld8.s r4 = [r10] ;; add r6 = r3, r4 ;; ld8.s r5 = [r6] cmp.eq p1, p2 = 6, r7 (p1) br.cond froboz ;; froboz: ... chk.s r5, recover back: sub r7 = r5, r2 recover: ld8 r3 = [r9] ld8 r4 = [r10] ;; add r6 = r3, r4 ;; ld8 r5 = [r6] br back Only one check needed in this block!

  34. Exception Deferral • Speculation can create speculative loads whose results are never used – want to avoid taking faults on these instructions • Deferral allows the efficient delay of costly exceptions • NaTs and checks (chk) enable deferral with recovery • Exceptions are taken in the recovery code, when it is known that the result will be used for sure

  35. IA-64 Support for Control Speculation • 65th bit (NaT bit) on each GR indicates if an exception has occurred • Special speculative loads that set the NaT bit if a deferrable exception occurs • Special chk.s instruction that checks the NaT bit and, if it is set, branches to recovery code • Computational instructions propagate NaTs like IEEE NaN’s • Compare operations propagate “false” when writing predicates

  36. Data Speculation • Example: • Suppose st1 writes into memory location fully or partially accessed by the ld8 instruction • If the compiler cannot guarantee that the store and the load addresses never overlap, it cannot reschedule the load before the store • Such store to load dependences are ambiguous memory dependences • Data speculation is ... • Moving loads (and possibly instructions that use the loaded values) above potentially overlapping stores • Three steps... Traditional Architectures instr1 instr2 . . . st1 [r3] = r4 ld8 r1 = [r2] ;; add r3 = r1,r4

  37. Data Speculation: Step 1 instr1 instr2 . . . st1 [r3] = r4 ld8.a r1 = [r2] ;; chk.a r1, recover add r3 = r1, r4 • Separate load behavior from overlap detection • The ld8.a performs normal loads and book-keeping • The chk.a checks “the books” to see if a conflict occurred • “The Books”: Advanced Load Address Table (ALAT) • The ld8.a puts information about an advanced load (address range accessed) into ALAT • Stores and other memory writers “snoop” ALAT and delete overlapping entries • The chk.a checks to see if a corresponding entry is in ALAT instr1 instr2 . . . st1 [r3] = r4 ld8 r1 = [r2] ;; add r3 = r1, r4

  38. Data Speculation: Step 2 • Reschedule load and (optionally) use • The ld8.a allocates an entry in the ALAT • If the st1 address overlaps with the ld8.a address, then the ALAT entry is removed • The chk.a checks for matching entry in ALAT – if found, speculation was OK; if not found, need to re-execute instr1 instr2 . . . st1 [r3] = r4 ld8.a r1 = [r2] ;; chk.a r1, recovery add r3 = r1, r4 ld8.a r1 = [r2] instr1 instr2 add r3 = r1, r4 . . . st1 [r3] = r4 ;; chk.a r1, recover

  39. Data Speculation: Step 3 • Add recovery code • All instructions speculated are repeated in a recovery block • Will only execute if the chk.a cannot find ALAT entry • Use non-speculative versions of instructions ld8.a r1 = [r2] instr1 instr2 add r3 = r1, r4 . . . st1 [r3] = r4 ;; chk.a r1, recover ld8.a r1 = [r2] ;; instr1 instr2 add r5 = r1, r4 ;; . . . st1 [r3] = r4 chk.a r1, recover back: ld8 r1 = [r2] ;; add r5 = r1, r4 br back ;;

  40. Transform Reschedule / Recovery Original load store use store load load chk.a store use use chk.a recover Data Speculation Summary

  41. IA-64 Support for Data Speculation • ALAT – Hardware structure that contains information about outstanding advanced loads • Instructions • Advanced loads: ld.a • Check loads: ld.c • Advance load checks: chk.a • Speculative advanced loads – ld.sa combines the semantics of ld.a and ld.s (i.e., it is a control speculative advanced load with fault deferral)

  42. IA-64 Application Architecture Overview • Application State • Instruction Format • Execution Semantics • Integer, Floating Point, and Multimedia Instructions • Memory • Control Speculation and Data Speculation • Predication • Parallel Compares • Branch Architecture • Register Stack

  43. Predication Concepts • Branching causes difficult to handle effects • Instruction stream changes (reduces fetching efficiency) • Requires branch prediction hardware • Requires execution of branch instructions • Potential branch mispredictions • IA-64 provides predication • Allows some branches to be moved • Allows some types of safe code motion beyond branches • Basis for branch architecture and conditional execution

  44. Predication • Allows instructions to be dynamically turned on or off using a predicate register value • Compare and test instructions write predicates with results of comparison/test • Most compare/test write result and complement • Example: cmp.eq p1,p2 = r1,0performs the operation p1r1==0, p2!(r1==0) • Most instructions can have a qualifying predicate (qp) • Example: (p1) add r1 = r2,r3 • If qp is true, instruction executes normally, else instruction squashed into iGoo • 64 1-bit predicate registers (true/false), p0 always “true”

  45. Control Flow Simplification Original Code Predicated Code • Predication • Changes control flow dependences into data dependences • Removes branches • Reduce / eliminate mispredictions and branch bubbles • Exposes parallelism • Increases pressure on instruction fetch unit p1 p2 (p1) (p2) (p1) (p2)

  46. Transform & Reschedule cmp p1, p2 = ... ;; (p2) cmp p3, p4 = … ;; (p1) r8 = 5 (p3) r8 = 7 (p4) r8 = 10 Multiple Selects Original cmp p1, p2 = ... (p1) br.cond ... p1 p2 r8 = 5 cmp p3, p4 = ... (p3) br.cond ... p3 p4 r8 = 7 r8 = 10

  47. Downward Code Motion Original Transform Reschedule p1 = false p1 = false p1 p1 p1 !p1 store !p1 (p1) store !p1 !p1 (p1) store (p1) store p1 true along all paths from red

  48. IA-64 Application Architecture Overview • Application State • Instruction Format • Execution Semantics • Integer, Floating Point, and Multimedia Instructions • Memory • Control Speculation and Data Speculation • Predication • Parallel Compares • Branch Architecture • Register Stack

  49. Parallel Compares • Parallel compares allow compound conditionals to be executed in a single instruction group. • Example • if ( a && b && c ) { . . . } • IA-64 Assembly • cmp.eq p1 = r0,r0 ;; // initialize p1=1 • cmp.ne.and p1 = rA,0 • cmp.ne.and p1 = rB,0 • cmp.ne.and p1 = rC,0

  50. cmp.and pA = ... (pA) br.cond x cmp.and pB = ... (pB) br.cond x cmp.and pC = ... (pC) br.cond x Parallel Compares: Height Reduction Original Transform & Reschedule cmp.eq pABC = r0,r0 ;; cmp.and pABC = ... cmp.and pABC = ... cmp.and pABC = ... (pABC) br.cond x First compare initialize pABC cmp.and instructions execute in parallel

More Related