1 / 34

Improving Database Performance on Simultaneous Multithreading Processors

Improving Database Performance on Simultaneous Multithreading Processors. Jingren Zhou Microsoft Research jrzhou@microsoft.com. John Cieslewicz Columbia University johnc@cs.columbia.edu. Kenneth A. Ross Columbia University kar@cs.columbia.edu. Mihir Shah Columbia University

kbaxter
Download Presentation

Improving Database Performance on Simultaneous Multithreading Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Database Performance on Simultaneous Multithreading Processors Jingren Zhou Microsoft Research jrzhou@microsoft.com John Cieslewicz Columbia University johnc@cs.columbia.edu Kenneth A. Ross Columbia University kar@cs.columbia.edu Mihir Shah Columbia University ms2604@columbia.edu

  2. Simultaneous Multithreading (SMT) • Available on modern CPUs: • “Hyperthreading” on Pentium 4 and Xeon. • IBM POWER5 • Sun UltraSparc IV • Challenge: Design software to efficiently utilize SMT. • This talk: Database software Intel Pentium 4 with Hyperthreading

  3. Superscalar Processor (no SMT) ... ... Time Instruction Stream ... ... CPI = 3/4 Superscalar pipeline (up to 2 instructions/cycle) • Improved instruction level parallelism

  4. ... ... SMT Processor ... ... Time Instruction Streams ... ... CPI = 5/8  • Improved thread level parallelism • More opportunities to keep the processor busy • But sometimes SMT does not work so well 

  5. Stalls ... ... Instruction Stream 1 Time ... ... Instruction Stream 2 ... ... Stall CPI = 3/4 . Progress despite stalled thread. Stalls due to cache misses (200-300 cycles for L2 cache), branch mispredictions (20-30 cycles), etc.

  6. Memory Consistency ... ... Instruction Stream 1 Time ... ... Instruction Stream 2 Detect conflicting access to common cache line ... ... flush pipeline + sync cache with RAM “MOMC Event” on Pentium 4. (300-350 cycles)

  7. SMT Processor • Exposes multiple “logical” CPUs (one per instruction stream) • One physical CPU (~5% extra silicon to duplicate thread state information) • Better than single threading: • Increased thread-level parallelism • Improved processor utilization when one thread blocks • Not as good as two physical CPUs: • CPU resources are shared, not replicated

  8. SMT Challenges • Resource Competition • Shared Execution Units • Shared Cache • Thread Coordination • Locking, etc. has high overhead • False Sharing • MOMC Events

  9. Approaches to using SMT • Ignore it, and write single threaded code. • Naïve parallelism • Pretend the logical CPUs are physical CPUs • SMT-aware parallelism • Parallel threads designed to avoid SMT-related interference • Use one thread for the algorithm, and another to manage resources • E.g., to avoid stalls for cache misses

  10. Naïve Parallelism • Treat SMT processor as if it is multi-core • Databases already designed to utilize multiple processors - no code modification • Uses shared processor resources inefficiently: • Cache Pollution / Interference • Competition for execution units

  11. SMT-Aware Parallelism • Exploit intra-operator parallelism • Divide input and use a separate thread to process each part • E.g., one thread for even tuples, one for odd tuples. • Explicit partitioning step not required. • Sharing input involves multiple readers • No MOMC events, because two reads don’t conflict

  12. SMT-Aware Parallelism (cont.) • Sharing output is challenging • Thread coordination for output • read/write and write/write conflicts on common cache lines (MOMC Events) • “Solution:” Partition the output • Each thread writes to separate memory buffer to avoid memory conflicts • Need an extra merge step in the consumer of the output stream • Difficult to maintain input order in the output

  13. Managing Resources for SMT • Cache misses are a well-known performance bottleneck for modern database systems • Mainly L2 data cache misses, but also L1 instruction cache misses [Ailamaki et al 98]. • Goal: Use a “helper” thread to avoid cache misses in the “main” thread • load future memory references into the cache • explicit load, not a prefetch

  14. Memory references that depend upon a previous memory access exhibit a data dependency E.g., Lookup hash table: Hash Buckets Overflow Cells Data Dependency Tuple

  15. Data Dependency (cont.) • Data dependencies make instruction level parallelism harder • Modern architectures provide prefetch instructions. • Request that data be brought into the cache • Non-blocking • Pitfalls: • Prefetch instructions are frequently dropped • Difficult to tune • Too much prefetching can pollute the cache

  16. A B C Staging Computation • Preload A. • (other work) • Process A. • Preload B. • (other work) • Process B. • Preload C. • (other work) • Process C. • Preload Tuple. • (other work) • Process Tuple. Hash Buckets Overflow Cells Tuple (Assumes each element is a cache line.)

  17. Staging Computation (cont.) • By overlapping memory latency with other work, some cache miss latency can be hidden. • Many probes “in flight” at the same time. • Algorithms need to be rewritten. • E.g. Chen, et al. [2004], Harizopoulos, et al. [2004].

  18. Work-Ahead Set: Main Thread • Writes memory address + computation state to the work-ahead set • Retrieves a previous address + state • Hope that helper thread can preload data before retrieval by the main thread • Correct whether or not helper thread succeeds at preloading data • helper thread is read-only

  19. 1 A 1 B 1 C D 1 1 E F 1 Work-ahead Set Data Structure state address Main Thread

  20. 1 A G 1 1 2 B H 1 2 C I J D 1 2 2 K 1 E F L 1 2 Work-ahead Set Data Structure state address Main Thread

  21. Work-Ahead Set: Helper Thread • Reads memory addresses from the work-ahead set, and loads their contents • Data becomes cache resident • Tries to preload data before main thread cycles around • If successful, main thread experiences cache hits

  22. Work-ahead Set Data Structure state address 1 G 2 H “ temp += *slot[i] ” Helper Thread 2 I 2 J 1 E 1 F

  23. Iterate Backwards! state address 1 G i = i-1 mod size i 2 H 2 I 2 J Helper Thread 1 E 1 F Why? See Paper.

  24. Helper Thread Speed • If helper thread faster than main thread: • More computation than memory latency • Helper thread should not preload twice (wasted CPU cycles) • See paper for how to stop redundant loads • If helper thread is slower: • No special tuning necessary • Main thread will absorb some cache misses

  25. Work-Ahead Set Size • Too Large: Cache Pollution • Preloaded data evicts other preloaded data before it can be used • Too Small: Thread Contention • Many MOMC events because work-ahead set spans few cache lines • Just Right: Experimentally determined • But use the smallest size within the acceptable range (performance plateaus), so that cache space is available for other purposes (for us, 128 entries) • Data structure itself much smaller than L2 cache

  26. Experimental Workload • Two Operators: • Probe phase of Hash Join • CSB+ Tree Index Join • Operators run in isolation and in parallel • Intel VTune used to measure hardware events

  27. Experimental Outline • Hash join • Index lookup • Mixed: Hash join and index lookup

  28. Hash JoinComparative Performance

  29. Hash JoinL2 Cache Misses Per Tuple

  30. CSB+ Tree Index JoinComparative Performance

  31. CSB+ Tree Index JoinL2 Cache Misses Per Tuple

  32. 20% 52% 55% Parallel Operator Performance

  33. 26% 29% Parallel Operator Performance

  34. Conclusion

More Related