670 likes | 791 Views
http://www.predator-project.eu/. Timing Analysis and Timing Predictability Reinhard Wilhelm Saarland University. Deriving Run-Time Guarantees for Hard Real-Time Systems. Given : a software to produce a reaction , a hardware platform , on which to execute the software ,
E N D
http://www.predator-project.eu/ Timing Analysis and Timing Predictability Reinhard Wilhelm Saarland University
Deriving Run-Time Guarantees for Hard Real-Time Systems Given: • a softwaretoproduce a reaction, • a hardwareplatform, on whichtoexecutethesoftware, • a requiredreactiontime. Derive: a guaranteefortimeliness.
Timing Analysis • Sound methods that determine upper bounds for all execution times, • can be seen as the search for a longest path, • through different types of graphs, • through a huge space of paths. I will show • how this huge state space originates, • how and how far we can cope with this huge state space, • what synchronous languages contribute.
Timing Analysis – the Search Space Input • all control-flow paths (through the binary executable) – depending on the possible inputs. • Feasible as search for a longest path if • Iteration and recursion are bounded, • Execution time of instructions are (positive) constants. • Elegant method: Timing Schemata (Shaw 89) – inductive calculation of upper bounds. Software Architecture (constant execution times) ub (if b then S1 else S2) := ub (b) + max (ub (S1), ub (S2))
High-Performance Microprosessors • increase (average-case) performance by using: Caches, Pipelines, Branch Prediction, Speculation • These features make timing analysis difficult:Execution times of instructions vary widely • Best case - everything goes smoothly: no cache miss, operands ready, resources free, branch correctly predicted • Worst case - everything goes wrong: all loads miss the cache, resources are occupied, operands not ready • Span may be several hundred cycles
LOAD r2, _a LOAD r1, _b ADD r3,r2,r1 Variability of Execution Times x = a + b; PPC 755 In most cases, execution will be fast. So, assuming the worst case is safe, but very pessimistic!
AbsInt‘s WCET Analyzer aiT IST Project DAEDALUS final review report: "The AbsInt tool is probably the best of its kind in the world and it is justified to consider this result as a breakthrough.” Several time-critical subsystems of the Airbus A380 have been certified using aiT; aiT is the only validated tool for these applications.
Tremendous Progressduring the past 14 Years 200 The explosion of penalties has been compensated by the improvement of the analyses! cache-miss penalty 60 25 30-50% 25% 20-30% 15% over-estimation 10% 4 2002 2005 1995 Lim et al. Thesing et al. Souyris et al.
State-dependent Execution Times state • Execution time depend on the execution state. • Execution state results from the execution history. semantics state: values of variables execution state: occupancy of resources
mul rD, rA, rB Timing Analysis – the Search Spacewith State-dependent Execution Times Input • all control-flow paths – depending on the possible inputs • all paths through the architecture for potential initialstates Software Architecture initial state execution states for paths reaching this program point 1 instruction in I-cache small operands 1 bus occupied instruction not in I-cache ≥ 40 4 bus not occupied large operands
Timing Analysis – the Search Spacewith out-of-order execution Input • all control-flow paths – depending on the possible inputs • all paths through the architecture for potential initialstates • including different schedules for instruction sequences Software Architecture initial state
Timing Analysis – the Search Spacewith multi-threading Input • all control-flow paths – depending on the possible inputs • all paths through the architecture for potential initialstates • including different schedules for instruction sequences • including different interleavings of accesses to shared resources Software Architecture initial state
Why Exhaustive Exploration? • Naive attempt: followlocalworst-casetransitionsonly • Unsound in thepresenceofTiming Anomalies:A path starting with a local worst case may have a lower overall execution time,Ex.: a cache miss preventing a branch mis-prediction • Caused by the interference between processor components: Ex.: cache hit/miss influences branch prediction; branch prediction causes prefetching; prefetching pollutes the I-cache.
State Space Explosion in Timing Analysis concurrency +shared resources preemptivescheduling Multi-core withshared resources:interleavingsof several threads out-of-orderexecution Superscalar processors:interleavingsof all schedules state-dependentexecution times Caches, pipelines,speculation:combined cache andpipeline analysis constant execution times years + methods ~1995 ~2000 2010 Timing schemata Static analysis ???
Coping with the Huge Search Space statespace Timing anomalies For caches: O(2cache capacity) abstraction no abstraction • exhaustive exploration of full program • intolerable effort • + compact representation • + efficient update • loss in precision • over-estimation • Splitting into • Local analysis on basic blocks • Global bounds analysis For caches: O(cache capacity) • exhaustive exploration of basic blocks • + tolerable effort
Notions Determinism: allows the prediction of the future behavior given the actual state and knowing the future inputs Stateis split into the semantics state, i.e. a point in the program + a mapping of variables to values, and the execution state, i.e. the allocation of variables to resources and the occupancy of resources Timing Repeatability: same execution time for all inputs to and initial execution states of a program allows the prediction of the future timing behavior given the PC and the control-flow context, without knowing the future inputs Predictability of a derived property of program behavior: expresses how this property of all behaviors of a program independent of inputs and initial execution state can be bounded, bounds are typically derived from invariants over the set of all execution traces P. expresses how this property of all future behaviors can be bounded given an invariant about a set of potential states at a program point without knowing future inputs
Some Observations/Thoughts Repeatability doesn’t make sense for behaviors, it may make sense for derived properties, e.g. time, space and energy consumption Predictability concerns properties of program behaviors, the set of all behaviors (collecting semantics) is typically not computable abstraction is applied to soundly approximate program behavior and to make their determination (efficiently) computable
Approaches Excluding arch.-state-dependent variability by design -> Excluding state- and input-dependent variability by design -> repeatability Bounding the variability, but ignoring invariants about the set of possible states, i.e. assuming arch.-state-independent worst cases for each “transition” Using invariants about sets of possible arch. states to bound future behaviors Designing architecture as to support analyzability independent of the applications or for a given set of applications PRET architecture w/o PRET programming PRET with PRET programming MERASA Predator Predator with PROMPT
Time Predictability • There are analysis-independent notions, • predictability as an inherent system property,e.g. Reineke et al. for caches, Grund et al. 2009i.e., predictable by an optimal analysis method • There are analysis-method dependent notions, • predictable by a certain analysis method. • In case of static analysis by abstract interpretation: designing a timing analysis means essentially designing abstract domains. • To achieve predictability is not difficult, not to loose much performance at the same time is!
Tool Architecture determines enclosing intervals for the values in registers and local variables determines loop bounds determines infeasible paths Abstract Interpretations determines a worst-case path and an upper bound derives invariants about architectural execution states, computes bounds on execution times of basic blocks combined cache and pipeline analysis Abstract Interpretation Integer Linear Programming
Timing Accidents and Penalties Timing Accident – cause for an increase of the execution time of an instruction Timing Penalty – the associated increase • Types of timing accidents • Cache misses • Pipeline stalls • Branch mispredictions • Bus collisions • Memory refresh of DRAM • TLB miss
Our Approach • Static Analysis of Programs for their behavior on the execution platform • computes invariants about the set of all potential execution states at all program points, • the execution states result from the execution history, • static analysis explores all execution histories state semantics state: values of variables execution state: occupancy of resources
Deriving Run-Time Guarantees • Our method and tool derives Safety Properties from these invariants : Certain timing accidents will never happen.Example:At program point p, instruction fetch will never cause a cache miss. • The more accidents excluded, the lower the upper bound. Murphy’s invariant Fastest Variance of execution times Slowest
Architectural Complexity impliesAnalysis Complexity Every hardware component whose state has an influence on the timing behavior • must be conservatively modeled, • may contribute a multiplicative factor to the size of the search space • Exception: Caches • some have good abstractions providing for highly precise analyses (LRU), cf. Diss. of J. Reineke • some have abstractions with compact representations, but not so precise analyses
Abstraction and Decomposition Components with domains of states C1, C2, … , Ck Analysis has to track domain C1 C2… Ck Start with the powerset domain 2 C1 C2… Ck Find an abstract domain C1# transform into C1# 2 C2… Ck Find abstractions C11# and C12# factor out C11# andtransform rest into 2 C12# … Ck This has worked for caches and cache-like devices. This has worked for the arithmetic of the pipeline. program with annotations program C11# 2 C12# … Ck microarchitectural analysis value analysis
Complexity Issues for Predictability by Abstract Interpretation Independent-attribute analysis Feasible for domains with no dependences or tolerable loss in precision Examples: value analysis, cache analysis Efficient! Relational analysis Necessary for mutually dependent domains Examples: pipeline analysis Highly complex Other parameters: Structure of the underlying domain, e.g. height of lattice; Determines speed of convergence of fixed-point iteration.
My Intuition • Current high-performance processors have cyclic dependences between components • Statically analyzing components in isolation (independent-attribute method) looses too much precision. • Goals: • cutting the cycle without loosing too much performance, • designing architectural components with compact abstract domains, • avoiding interference on shared resources in multi-core architectures (as far as possible). • R.Wilhelm et al.: Memory Hierarchies, Pipelines, and Buses for Future Architectures in Time-critical Embedded Systems, IEEE TCAD, July 2009
Caches: Small & Fast Memory on Chip Bridge speed gap between CPU and RAM Caches work well in the average case: Programs access data locally (many hits) Programs reuse items (instructions, data) Access patterns are distributed evenly across the cache Cache performance has a strong influence on system performance! The precision of cache analysis has a strong influence on the degree of over-estimation!
Caches: How they work CPU: read/write at memory address a, sends a request for a to bus Cases: Hit: Block m containing a in the cache: request served in the next cycle Miss: Blockm not in the cache:m is transferred from main memory to the cache, m may replace some block in the cache,request for a is served asap while transfer still continues Replacement strategy: LRU, PLRU, FIFO,...determine which line to replace in a full cache (set) m a
Cache Analysis How to statically precompute cache contents: Must Analysis:For each program point (and context), find out which blocks are in the cache prediction of cache hits May Analysis: For each program point (and context), find out which blocks may be in the cacheComplement says what is not in the cache prediction of cache misses In the following, we consider must analysis until otherwise stated.
(Must) Cache Analysis Consider one instruction in the program. There may be many paths leading to this instruction. How can we compute whether a will always be in cache independently of which path execution takes? load a Question: Is the access to a always a cache hit?
Determine LRU-Cache-Information(abstract cache states) at each Program Point youngest age - 0 oldest age - 3 {x} {a, b} • Interpretation of this cache information: • describes the set of all concrete cache states • in which x, a, and b occur • x with an age not older than 1 • a and bwith an age not older than2, • Cache information contains • only memory blocks guaranteed to be in cache. • they are associated with their maximal age.
Cache- Information Cache analysis determines safe information about Cache Hits.Each predicted Cache Hit reduces the upper bound by the cache-miss penalty. load a Computed cache information {x} {a, b} Access to ais a cache hit; assume 1 cycle access time.
Cache Analysis – how does it work? How to compute for each program point an abstract cache state representing a set of memory blocks guaranteed to be in cache each time execution reaches this program point? Can we expect to compute the largest set? Trade-off between precision and efficiency – quite typical for abstract interpretation
(Must) Cache analysis of a memory access with LRU replacement x {x} a {a, b} b y access to a access to a {a} a x {b, x} b y concrete transfer function (cache) abstract transfer function (analysis) After the access to a, a is the youngest memory block in cache, and we must assume that x has aged.
Combining Cache Information Consider two control-flow paths to a program point with sets S1 and S2 of memory blocks in cache,. Cache analysis should not predict more than S1 S2 after the merge of paths. elements in the intersection with their maximal age from S1 and S2. Suggests the following method: Compute cache information along all paths to a program point and calculate their intersection – but too many paths! More efficient method: combine cache information on the fly, iterate until least fixpoint is reached. There is a risk of losing precision, but not in case of distributive transfer functions.
What happens when control-paths merge? We can guarantee this content on this path. We can guarantee this content on this path. { a } { } { c, f } { d } { c } { e } { a } { d } Which content can we guarantee on this path? “intersection + maximal age” { } { } { a, c } { d } combine cache information at each control-flow merge point
Predictability of Caches - Speed of Recovery from Uncertainty - J. Reineke et al.: Predictability of Cache Replacement Policies, Real-Time Systems, Springer, 2007
Metrics of Predictability: evict & fill
Results: tight bounds Generic examples prove tightness.
The Influence of the Replacement Strategy Information gain through access to m LRU: FIFO: m + aging of prefix of unknown length of the cache contents m cache at least k-1 youngest still in cache m m m m m m m
Fetch Decode Execute WB Fetch Decode Execute Execute Fetch WB WB Decode Fetch Decode WB Execute Pipelines Inst 1 Inst 3 Inst 2 Inst 4 Fetch Decode Execute WB Ideal Case: 1 Instruction per Cycle
CPU as a (Concrete) State Machine • Processor (pipeline, cache, memory, inputs) viewed as a bigstate machine, performing transitions every clock cycle • Starting in an initial state for an instruction, transitions are performed, until a final state is reached: • End state: instruction has left the pipeline • # transitions: execution time of instruction
Pipeline Analysis • simulates the concrete pipeline on abstract states • counts the number of steps until an instruction retires • non-determinism resulting from abstraction and timing anomalies require exhaustive exploration of paths
s1 s3 s2 s1 Basic Block s10 s13 s11 s12 Integrated Analysis: Overall Picture Fixed point iteration over Basic Blocks (in context) {s1, s2, s3}abstract state Cyclewise evolution of processor modelfor instruction s1s2s3 move.1 (A0,D0),D1
Implementation • Abstract modelisimplementedas a DFA • Instructionsarethenodes in the CFG • Domain ispowersetofsetofabstractstates • Transfer functionsattheedges in the CFG iteratecycle-wiseupdatingeachstate in thecurrentabstractvalue • max{# iterationsfor all states}givesbound • Fromthis, wecanobtainboundsforbasicblocks
Classification of Pipelined Architectures • Fully timing compositional architectures: • no timing anomalies. • analysis can safely follow local worst-case paths only, • example: ARM7. • Compositional architectures with constant-bounded effects: • exhibit timing anomalies, but no domino effects, • example: Infineon TriCore • Non-compositional architectures: • exhibit domino effects and timing anomalies. • timing analysis always has to follow all paths, • example: PowerPC 755
Extended the Predictability Notion • The cache-predictability concept applies to all cache-like architecture components: • TLBs, BTBs, other history mechanisms • It does not cover the whole architectural domain.