1 / 33

CSE 326: Data Structures: Sorting

CSE 326: Data Structures: Sorting. Lecture 17: Wednesday, Feb 19, 2003. Today. Bucket sort (review) Radix sort Merge sort for external memory. Lower Bound. Recall: Any sorting algorithm based on comparisons requires (n log n) time

erwinm
Download Presentation

CSE 326: Data Structures: Sorting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003

  2. Today • Bucket sort (review) • Radix sort • Merge sort for external memory

  3. Lower Bound Recall: • Any sorting algorithm based on comparisons requires (n log n) time • More precisely: there exists a “bad input” for that algorithm, on which it takes (n log n) • The theorem can be extended: the average running time is (n log n) • The next two algorithms (Bucket, Radix) apparently break this theorem !

  4. Bucket Sort • Now let’s sort in O(N) • Assume:A[0], A[1], …, A[N-1] {0, 1, …, M-1}M = not too big • Example: sort 1,000,000 person records on the first character of their last names: • Hence M = 128 (in practice: M = 27)

  5. Bucket Sort int bucketSort(Array A, int N) { for k = 0 to M-1 Q[k] = new Queue; for j = 0 to N-1 Q[A[j]].enqueue(A[j]); Result = new Queue; for k = 0 to M-1 Result = Result.append(Q[k]); return Result; } Stablesorting !

  6. Bucket Sort • Running time: O(M+N) • Space: O(M+N) • Recall that M << N, hence time = O(N) • What about the Theorem that says sorting takes (N log N) ?? This is not based on key comparisons, insteadexploits the fact that keysare small

  7. Radix Sort • I still want to sort in time O(N): non-trivial keys • A[0], A[1], …, A[N-1] are strings • Very common in practice • Each string is:cd-1cd-2…c1c0, where c0, c1, …, cd-1{0, 1, …, M-1}M = 128 • Other example: decimal numbers

  8. RadixSort • Radix = “The base of a number system” (Webster’s dictionary) • alternate terminology: radix is number of bits needed to represent 0 to base-1; can say “base 8” or “radix 3” • Used in 1890 U.S. census by Hollerith • Idea: BucketSort on each digit, bottom up.

  9. The Magic of RadixSort • Input list: 126, 328, 636, 341, 416, 131, 328 • BucketSort on lower digit:341, 131, 126, 636, 416, 328, 328 • BucketSort result on next-higher digit:416, 126, 328, 328, 131, 636, 341 • BucketSort that result on highest digit:126, 131, 328, 328, 341, 416, 636

  10. Inductive Proof that RadixSort Works • Keys: d-digit numbers, base B • (that wasn’t hard!) • Claim: after ith BucketSort, least significant i digits are sorted. • Base case: i=0. 0 digits are sorted. • Inductive step: Assume for i, prove for i+1. Consider two numbers: X, Y. Say Xi is ith digit of X: • Xi+1< Yi+1 then i+1th BucketSort will put them in order • Xi+1> Yi+1 , same thing • Xi+1= Yi+1 , order depends on last i digits. Induction hypothesis says already sorted for these digits because BucketSort is stable

  11. Radix Sort int radixSort(Array A, int N) { for k = 0 to d-1 A = bucketSort(A, on position k) } Running time: T = O(d(M+N)) = O(dN) = O(Size)

  12. Radix Sort A= A= A=

  13. Running time of Radixsort • N items, d digit keys of max value M • How many passes? • How much work per pass? • Total time?

  14. Running time of Radixsort • N items, d digit keys of max value M • How many passes? d • How much work per pass? N + M • just in case M>N, need to account for time to empty out buckets between passes • Total time? O( d(N+M) )

  15. Radix Sort • What is the size of the input ? Size = dN • Radix sort takes time O(Size) !!

  16. Radix Sort • Variable length strings: • Can adapt Radix Sort to sort in time O(Size) ! • What about our Theorem ??

  17. Radix Sort • Suppose we want to sort N distinct numbers • Represent them in decimal: • Need d=log N digits • Hence RadixSort takes time O(Size) = O(dN) = O(N log N) • The total Size of N keys is O(N log N) ! • No conflict with theory 

  18. Sorting HUGE Data Sets • US Telephone Directory: • 300,000,000 records • 64-bytes per record • Name: 32 characters • Address: 54 characters • Telephone number: 10 characters • About 2 gigabytes of data • Sort this on a machine with 128 MB RAM… • Other examples?

  19. Merge Sort Good for Something! • Basis for most external sorting routines • Can sort any number of records using a tiny amount of main memory • in extreme case, only need to keep 2 records in memory at any one time!

  20. External MergeSort • Split input into two “tapes” (or areas of disk) • Merge tapes so that each group of 2 records is sorted • Split again • Merge tapes so that each group of 4 records is sorted • Repeat until data entirely sorted log N passes

  21. Sorting • Illustrates the difference in algorithm design when your data is not in main memory: • Problem: sort 8Gb of data with 8Mb of RAM • We know we can do it in O(n log n) time, but let’s see the number of disk I/O’s

  22. 2-Way Merge-sort:Requires 3 Buffers • Pass 1: Read a page, sort it, write it. • only one buffer page is used • Pass 2, 3, …, etc.: • three buffer pages used. Buffer size = 8Kb (typically) INPUT 1 OUTPUT INPUT 2 Main memory buffers Disk Disk

  23. 2-Way Merge-sort • A run = a sequence of sorted elements • Main property of 2-way merge: • If the minimum run length is L, then after merge the minimum run length is 2L • Initially: minimum run length = 1 (why ?) • After one pass = 2 • After two passes = 4 • . . .

  24. Two-Way External Merge Sort 6,2 2 Input file 9,4 8,7 5,6 3,1 3,4 PASS 0 1,3 2 1-page runs 2,6 4,9 7,8 5,6 3,4 • Each pass we read + write each page in file. • N pages in the file => the number of passes • So total cost is: • Improvement: start with larger runs • Sort 1GB with 1MB memory in 10 passes PASS 1 4,7 1,3 2,3 2-page runs 8,9 5,6 2 4,6 PASS 2 2,3 4,4 1,2 4-page runs 6,7 3,5 6 8,9 PASS 3 1,2 2,3 3,4 8-page runs 4,5 6,6 7,8 9

  25. 2-Way Merge-sort • Hence we need exactly log N passes through the entire data • How much is N ? N  106 (why ?) • Hence we need to read and write the entire 8GB data log(106) = 20 times ! • It takes about 1minute to read 1GB of data • even more • It takes at least160 minutes to sort = 3 hours

  26. 2-Way Merge-sort: less dumb • Use the 8Mb of main memory better ! • Initial step: Run formation • Read 8Mb of data in main memory • Sort (what algorithm would you use ?) • Write to disk • Now the runs are 8Mb after one pass ! • After subsequent passes the runs are • 2  8Mb, 4  8Mb, . . . • Need log(8Gb/8Mb) = log(103) = 10 passes • 1.5h instead of 3h have time to see a movie

  27. Can We Do Better ? • We have more main memory • Used it during all passes • Multiway merge: • Given M sorted sequence • Merge them

  28. Multiway Merge m At each step:select the smallestvalue among the M,store in the output What datastructure shouldwe use here ?

  29. Multiway Merge-Sort • Phase one: load M bytes in memory, sort • Result: runs of length M/B blocks M/B blocks . . . . . . Disk Disk M bytes of main memory B = size of one block = 8Kb typically

  30. Pass Two • Merge m = M/B runs into a new run • Runs have M/B (M/B – 1)  (M/B)2 blocks Input 1 . . . . . . Input 2 Output . . . . Input M/B Disk Disk M bytes of main memory

  31. Pass Three • Merge M/B – 1 runs into a new run • Runs have now M/R (M/B – 1)2 (M/B)3 blocks Input 1 . . . . . . Input 2 Output . . . . Input M/B Disk Disk M bytes of main memory

  32. Multiway Merge-Sort • Input file has N bytes • Need logM/B (N/B) = (log(N/B)) / (log(M/B)) complete passes over the file • N/B = 8Gb / 8Kb = 106 • M/B = 8Mb / 8Kb = 103 • Hence need log(106)/log(103) = 2 passes ! • 2 minutes ! Time for two movies

  33. Multiway Merge-Sort • With today’s main memories, we can sort almost any file in two passes • The file can have (M/B)2 blocks • XML Toolkit: • xsort sorts using multiway merge, in two passes

More Related