1 / 23

Motivation & Related Work

Asynchronous Parallel Disk Sorting Roman Dementiev and Peter Sanders Max-Planck-Institut f ü r Informatik, Saarbr ücken, Germany Thanks to: A. Crauser, D. Hutchinson, L. Kettner, S. Mitra, A. Morton, N. Rajput, and J. Vitter. Motivation & Related Work. Sorting is important for EM problems

Download Presentation

Motivation & Related Work

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Asynchronous Parallel Disk SortingRoman Dementiev and Peter SandersMax-Planck-Institut für Informatik,Saarbrücken, GermanyThanks to: A. Crauser, D. Hutchinson, L. Kettner, S. Mitra, A. Morton, N. Rajput, and J. Vitter R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  2. Motivation & Related Work • Sorting is important for EM problems • Disks are cheaper than computers • Prefetch buffers for load disk balancing and overlapping have been studied but no results which guarantee • overlapping during merging of runs • asymptotically optimal I/O volume • Here:algorithm & implementation • I/O cost lower bound • guarantees almost perfect overlapping between I/O and computation R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  3. Motivation & Related Work • [ChaudhryCormen’01’02] impl. of distributed memory, parallel disk • Theory: • Implementations: Differences: • Optimal prefetching algorithm • Overlapping of I/O and computation • Completely asynchronous implementation • Part of <stxxl> library, compatible with STL. Use as simple as: stxxl::vector<my_type> v; long long int i=vec_size; // ‘long long int’ is a 64-bit data type while(i--) v.push_back(f(i)); // fill vector with some values stxxl::sort(v.begin(), v.end()); // sort // check order using STL predicate assert(std::is_sorted(v.begin(), v.end())); [BarvGrovVit’97] [SandEgnKorst’00] [HutchVitt’01] [HutchSandVitt’02] [BarveVitter’99] Here [DementievSanders’03] R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  4. elements O(M) 1 D-head disk 2 input: chunk 0 chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 chunk 7 chunk 8 … 3 D-head model k=O(M/B) k 4 D prefetch buffers D write buffers merge merge buffers ….. run 0 run 1 run 2 run 3 run 4 run 5 run 6 run 7 k D blocks merging buffers buffers B B B B B B B B k-merger k-merger B B run 0123 run 4567 … External Multi-way Merge Sort N – input size M – main memory size B – block size D – number of disks These algorithms do not guarantee overlapping R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  5. Multi-way Merge Sort with Overlapped I/Os Theorem 1: If I/O and computation can be overlapped, N elements can be sorted in time where is the time needed for run formation, is the time needed for merging k=(M/B) : # sequences, k’ = O(N/M): total # runs. R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  6. Overlapping Run Formation • Easy: Corollary 2: Input of size N => sorted runs of size ~ M/2 in time 2M in the paper run numbers time 3 1 2 4 k-1 k ... read sort write 2 3 4 k-1 k I/O bound case ... 1 1 2 3 4 k-1 k ... 1 2 3 4 k-1 k read sort write compute bound case 1 2 3 4 k-1 k 1 2 3 4 k-1 k R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  7. sort pairs <trigger,block_id> …. prediction sequence  : …. k-merger Simple External Merging • Smallest element of a block is a trigger merger k sorted runs …. R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  8. Overlapping I/O and Merging • Prediction of delay between issuing two reads is not easy: 1B-12 3B-14 5B-16 … merger 1B-12 3B-14 5B-16 … k … 1B-12 3B-14 5B-16 … R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  9. elements 1 D-head disk 2 3 4 D prefetch buffers D write buffers merge merge buffers ….. k D blocks merging elements 1 D-head disk 2 3 k+3D overlap buffers 4 D prefetch buffers 2D write buffers merge merge buffers ….. k D blocks merging Overlapping I/O and Merging • Prediction of delay between issuing two reads is not easy: • Solution: 2 3B-14 5B-16 … merger 2 3B-14 5B-16 … k … 2 3B-14 5B-16 … R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  10. Overlapping I/O and Merging [contd.] • Theorem 4:Merging k sequences with a total of N’ elements takes time • ℓ: time to produce one element of output • L: time to input or output D arbitrary blocks • Our I/O thread strategy: •  DB elements in write buffer => output step • < DB elements in write buffer AND D overlap buffers avail. => input step • To prove Theorem 4 we distinguish two cases • I/O bound case: 2L  DBℓ • Compute bound case: 2L< DBℓ R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  11. Compute Bound Case r kB+3DB • Lemma 6:In compute bound case after k/D+1 steps, the merging thread never blocks until all elements are merged. • Notation: •  # elem. in write buffer • r # of elem. in the overlap and merge buffers • Our I/O thread strategy: •  DB elements in write buffer => output step • < DB elements in write buffer AND D overlap buffers avail. => input step I/O blocking kB+2DB output fetch kB+DB kB+2 L/ℓ kB+y L/ℓ merge thread blocked kB  0 DB 2DB- L/ℓ 2DB R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  12. I/O Bound Case • Lemma 7:In I/O bound case the I/O thread never blocks until all input blocks are fetched. • Our I/O thread strategy (the same): •  DB elements in write buffer => output step • < DB elements in write buffer AND D overlap buffers avail. => input step • Proof: similar to compute bound case r kB+3DB kB+2DB+ L/ℓ blocking kB+2DB output kB+DB+ L/ℓ fetch kB+DB  kB+ L/ℓ 0 DB- L/ℓ DB 2DB- L/ℓ 2DB R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  13. elements … 1 … read buffers 2 … 3 k+O(D) overlap buffers … 4 O(D) prefetch buffers 2D write buffers merge ….. … merge buffers k D blocks merging -ping disk scheduling overlap- Multi-head Model  Multi-disk Model • Randomized Cyclic Allocation [VitHut’01] makes accesses in  balanced • Optimal prefetching from [HutchSanVitt’01,SanEgnKor’00] • Corollary 8:For any  > 0, prefetch buffer of size m=(D/) merging k sequences with a total N’ elements can be implemented to run in time R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  14. Applications Sorting, Priority queues, Search trees STL + more Caching, Mem. management Disk scheduling POSIX threads, File system OS Implementation Issues • Implementation (part of <stxxl> library) • Forming runs: Key sorting, efficient two passes MSD radix sort • Multi-way merging: [Sanders’00] • prefetch buffer + overlap buffer = read buffer • Asynchronous I/O (POSIX threads) • No superfluous copying • User blocks are transferred by DMA (O_DIRECT flag) • Buffers are passed by pointer between pools <stxxl> structure R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  15. Hardware 3000 Euro, 375 MByte/s, July 2002 R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  16. Single Disk Performance • LEDA-SM, TPIE • 2 GB volume, 32-bit keys, runs of size 256 MB, g++ 3.2 –O6 • I/O bandwidth 45.4 Mbyte/s = 95% of peak bandwidth of the disk R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  17. Multiple Disk Performance • 2 GB volume, 32-bit keys, runs of size 256 MB, g++ 3.2 –O6 • TPIE, LEDA-SM • Linux Soft-RAID 8X128KByte stripes • Bandwidth of 315 MByte/s R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  18. Element Size • 16GB volume, 32-bit keys, runs of size 256 MB, g++ 3.2 –O6 •  64  merging is I/O bound •  128  run formation is I/O bound • Small elements require special treatment R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  19. Optimal Prefetching 1 read buffers 2 3 k+O(D) overlap buffers 4 2D write buffers merge ….. … merge buffers k D blocks merging 1 read buffers 2 3 k+O(D) overlap buffers 4 O(D) prefetch buffers 2D write buffers merge ….. … merge buffers k D blocks merging R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  20. Block Size • Block sizes of several MB => good performance • Leave room for read and write buffers • Naive striping implies factor 8 block size reduction  Dramatic run time increase R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  21. Large Inputs • One-pass, curves go up because of • Slower zones • Seek times R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  22. Summary • Parallel disk sorting with high performance on state of art hardware with theoretical performance guarantees • Bandwidth is no longer a limiting factor for external memory algorithms • Price performance ratio can improve by adding disks R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

  23. ? R. Dementiev and P. Sanders: Asynchronous Parallel Disk Sorting

More Related