400 likes | 542 Views
Presenter. MaxAcademy Lecture Series – V1.0, September 2011. Stream Scheduling. Overview. Latencies in stream computing Scheduling algorithms Stream offsets. Latencies in Stream Computing. Consider a simple arithmetic pipeline Each operation has a latency
E N D
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling
Overview • Latencies in stream computing • Scheduling algorithms • Stream offsets
Latencies in Stream Computing • Consider a simple arithmetic pipeline • Each operation has a latency • Number of cycles from input to output • May be zero • Throughput is still 1 value per cycle, L values can be in-flight in the pipeline (A + B) + C
Input A Input B InputC Output + + Basic hardware implementation
Input A Input B InputC Output 1 2 3 + + Data propagates through the circuit in “lock step”
Input A Input B InputC Output 1 2 + 3 +
Input A Input B InputC Output + 1 2 + X 3 Data arrives at wrong time due to pipeline latency
Input A Input B InputC Output + + Insert buffering to correct
Input A Input B InputC Output 1 2 3 + + Now with buffering
Input A Input B InputC Output + 1 2 3 +
Input A Input B InputC Output + 3 3 +
Input A Input B InputC Output + + 3 3
Input A Input B InputC Output + + 6
Input A Input B InputC Output + + Success! 6
Stream Scheduling Algorithms • A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations • Can be automatically applied on a large dataflow graph (many thousands of nodes) • Can try to optimize for various metrics • Latency from inputs to outputs • Amount of buffering inserted generally most interesting • Area (resource sharing)
ASAP As Soon As Possible
Input Input A Input A Input Input B InputC 0 0 0 Build up circuit incrementally Keeping track of latencies
Input Input A Input A Input Input B InputC 0 0 0 + 1
Input Input A Input A Input Input B InputC 0 0 0 + 1 + Input latencies are mismatched
Input Input A Input A Input Input B InputC 0 0 0 + 1 1 + 2 Insert buffering
Input Input A Input A Input Input B InputC Output 0 0 0 + 1 1 + 2
ALAP As Late As Possible
Output Start at output 0
Output Latencies are negative relative to end of circuit + -1 -1 0
InputC Output -2 + -2 + -1 -1 0
Input Input A Input A Input Input B InputC Output -2 + -2 + -1 -1 0
Input Input A Input A Input Input B InputC Output Buffering is saved -2 + -2 + -1 -1 0
Input Input A Input A Input Input B InputC Output 2 Output 1 Sometimes this is suboptimal + + What if we add an extra output?
Input Input A Input A Input Input B InputC Output 2 Output 1 Unnecessary buffering is added -2 + -2 + -1 -1 Neither ASAP nor ALAP can schedule this design optimally 0 0
Optimal Scheduling • ASAP and ALAP both fix either inputs or outputs in place • More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP
Buffering data on-chip • Consider: • We can see that we might need some explicit buffering to hold more than one data element on-chip • We could do this explicitly, with buffering elements a[i] = a[i] + (a[i - 1] + b[i - 1]) a = a + (buffer(a, 1) + buffer(b, 1))
Input A Input B Output Buffer(1) Buffer(1) + + The buffer has zero latency in the schedule
Input A Input B Output 0 0 Buffer(1) Buffer(1) 0 0 + 1 + 1 2 This will schedule thus Buffering = 3
Buffers and Latency • Accessing previous values with buffers is looking backwards in the stream • This is equivalent to having a wire with negative latency • Can not be implemented directly, but can affect the schedule
Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 Offset wires can have negative latency
Input A Input B Output 0 0 Offset(-1) Offset(-1) -1 -1 -1 + -1 + 0 1 This is scheduled Buffering = 0
Stream Offsets • A stream offset is just a wire with a positive or negative latency • Negative latencies look backwards in the stream • Positive latencies look forwards in the stream • The entire dataflow graph will re-schedule to make sure the right data value is present when needed • Buffering could be placed anywhere, or pushed into inputs or outputs more optimal than manual instantiation
Input A Output 0 a[i] = a + a[i + 1] Offset(1) + a = a + stream.offset(a, +1)
Input A Output 0 Offset(1) 1 1 + 2 Scheduling produces a circuit with 1 buffer
Exercises For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. • Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: • c = ( (a1 + a2) + a3) + a4 • c = (a1 + a2) + (a3 + a4) • Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: • c = ((a1 * a2) + (a3 * a4)) + a1 • c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)?