Improving Database Performance on Simultaneous Multithreading Processors

Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia University johnc@cs.columbia.edu Kenneth A. Ross Columbia University kar@cs.columbia.edu Mihir Shah Columbia University ms2604@columbia.edu

Simultaneous Multithreading (SMT) • Available on modern CPUs: • “Hyperthreading” on Pentium 4 and Xeon. • IBM POWER5 • Sun UltraSparc IV • Challenge: Design software to efficiently utilize SMT. • This talk: Database software Intel Pentium 4 with Hyperthreading

Superscalar Processor (no SMT) ... ... Time Instruction Stream ... ... CPI = 3/4 Superscalar pipeline (up to 2 instructions/cycle) • Improved instruction level parallelism

... ... SMT Processor ... ... Time Instruction Streams ... ... CPI = 5/8  • Improved thread level parallelism • More opportunities to keep the processor busy • But sometimes SMT does not work so well 

Stalls ... ... Instruction Stream 1 Time ... ... Instruction Stream 2 ... ... Stall CPI = 3/4 . Progress despite stalled thread. Stalls due to cache misses (200-300 cycles for L2 cache), branch mispredictions (20-30 cycles), etc.

Memory Consistency ... ... Instruction Stream 1 Time ... ... Instruction Stream 2 Detect conflicting access to common cache line ... ... flush pipeline + sync cache with RAM “MOMC Event” on Pentium 4. (300-350 cycles)

SMT Processor • Exposes multiple “logical” CPUs (one per instruction stream) • One physical CPU (~5% extra silicon to duplicate thread state information) • Better than single threading: • Increased thread-level parallelism • Improved processor utilization when one thread blocks • Not as good as two physical CPUs: • CPU resources are shared, not replicated

SMT Challenges • Resource Competition • Shared Execution Units • Shared Cache • Thread Coordination • Locking, etc. has high overhead • False Sharing • MOMC Events

Approaches to using SMT • Ignore it, and write single threaded code. • Naïve parallelism • Pretend the logical CPUs are physical CPUs • SMT-aware parallelism • Parallel threads designed to avoid SMT-related interference • Use one thread for the algorithm, and another to manage resources • E.g., to avoid stalls for cache misses

Naïve Parallelism • Treat SMT processor as if it is multi-core • Databases already designed to utilize multiple processors - no code modification • Uses shared processor resources inefficiently: • Cache Pollution / Interference • Competition for execution units

SMT-Aware Parallelism • Exploit intra-operator parallelism • Divide input and use a separate thread to process each part • E.g., one thread for even tuples, one for odd tuples. • Explicit partitioning step not required. • Sharing input involves multiple readers • No MOMC events, because two reads don’t conflict

SMT-Aware Parallelism (cont.) • Sharing output is challenging • Thread coordination for output • read/write and write/write conflicts on common cache lines (MOMC Events) • “Solution:” Partition the output • Each thread writes to separate memory buffer to avoid memory conflicts • Need an extra merge step in the consumer of the output stream • Difficult to maintain input order in the output

Managing Resources for SMT • Cache misses are a well-known performance bottleneck for modern database systems • Mainly L2 data cache misses, but also L1 instruction cache misses [Ailamaki et al 98]. • Goal: Use a “helper” thread to avoid cache misses in the “main” thread • load future memory references into the cache • explicit load, not a prefetch

Memory references that depend upon a previous memory access exhibit a data dependency E.g., Lookup hash table: Hash Buckets Overflow Cells Data Dependency Tuple

Data Dependency (cont.) • Data dependencies make instruction level parallelism harder • Modern architectures provide prefetch instructions. • Request that data be brought into the cache • Non-blocking • Pitfalls: • Prefetch instructions are frequently dropped • Difficult to tune • Too much prefetching can pollute the cache

A B C Staging Computation • Preload A. • (other work) • Process A. • Preload B. • (other work) • Process B. • Preload C. • (other work) • Process C. • Preload Tuple. • (other work) • Process Tuple. Hash Buckets Overflow Cells Tuple (Assumes each element is a cache line.)

Staging Computation (cont.) • By overlapping memory latency with other work, some cache miss latency can be hidden. • Many probes “in flight” at the same time. • Algorithms need to be rewritten. • E.g. Chen, et al. [2004], Harizopoulos, et al. [2004].

Work-Ahead Set: Main Thread • Writes memory address + computation state to the work-ahead set • Retrieves a previous address + state • Hope that helper thread can preload data before retrieval by the main thread • Correct whether or not helper thread succeeds at preloading data • helper thread is read-only

1 A 1 B 1 C D 1 1 E F 1 Work-ahead Set Data Structure state address Main Thread

1 A G 1 1 2 B H 1 2 C I J D 1 2 2 K 1 E F L 1 2 Work-ahead Set Data Structure state address Main Thread

Work-Ahead Set: Helper Thread • Reads memory addresses from the work-ahead set, and loads their contents • Data becomes cache resident • Tries to preload data before main thread cycles around • If successful, main thread experiences cache hits

Work-ahead Set Data Structure state address 1 G 2 H “ temp += *slot[i] ” Helper Thread 2 I 2 J 1 E 1 F

Iterate Backwards! state address 1 G i = i-1 mod size i 2 H 2 I 2 J Helper Thread 1 E 1 F Why? See Paper.

Helper Thread Speed • If helper thread faster than main thread: • More computation than memory latency • Helper thread should not preload twice (wasted CPU cycles) • See paper for how to stop redundant loads • If helper thread is slower: • No special tuning necessary • Main thread will absorb some cache misses

Work-Ahead Set Size • Too Large: Cache Pollution • Preloaded data evicts other preloaded data before it can be used • Too Small: Thread Contention • Many MOMC events because work-ahead set spans few cache lines • Just Right: Experimentally determined • But use the smallest size within the acceptable range (performance plateaus), so that cache space is available for other purposes (for us, 128 entries) • Data structure itself much smaller than L2 cache

Experimental Workload • Two Operators: • Probe phase of Hash Join • CSB+ Tree Index Join • Operators run in isolation and in parallel • Intel VTune used to measure hardware events

Experimental Outline • Hash join • Index lookup • Mixed: Hash join and index lookup

Hash JoinComparative Performance

Hash JoinL2 Cache Misses Per Tuple

CSB+ Tree Index JoinComparative Performance

CSB+ Tree Index JoinL2 Cache Misses Per Tuple

20% 52% 55% Parallel Operator Performance

26% 29% Parallel Operator Performance

Conclusion

Improving Database Performance on Simultaneous Multithreading Processors