1 / 23

Smarter Sorts

Smarter Sorts. Beyond the Obvious. Jerry Le Breton ( Softscape Solutions) & Doug Lean (DHS). Sorting –The Obvious First. Why Sort ? “Data and information is almost always presented in a sorted or structured way”. Sorting - The Obvious First. proc sort data =claims;

Download Presentation

Smarter Sorts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Smarter Sorts Beyond the Obvious Jerry Le Breton (Softscape Solutions) & Doug Lean (DHS) SAUSAG 69 – 20 Feb 2014

  2. Sorting –The Obvious First • Why Sort ? • “Data and information is almost always presented in a sorted or structured way”

  3. Sorting - The Obvious First procsortdata=claims; by claim client; Sort puts your records in order - BY the values of the variables you list. • Its important to know your data • How many variables • How many distinct data values for each SAUSAG 69 – 20 Feb 2014

  4. Sorting – Do You Need To? procsortdata=claims; by claim; Proc tabulate ...; classclaim; ... • Some PROCS do their own sorting: • TABULATE • MEANS • REPORT • SQL • (which can run out of memory for really big data sets) An unnecessary SORT SAUSAG 69 – 20 Feb 2014

  5. Sorting – Do You Need To? • Only use PROC SORT before REPORT, TABULATE, MEANS if there’s another reason later. • For PROC MEANS substitute BY with CLASS e.g. PROC MEANS NWAY; CLASS x y z; Is similar to PROC SORT; BY x y z; PROC MEANS; BY x y z; And saves significant time by avoiding the SORT SAUSAG 69 – 20 Feb 2014

  6. Sort Only What You Need Sort just the rows you want… procsortdata=claims out=Sorted_claims; where client =: 'A'; by claim; … and just the columns you want… procsortdata=claims(keep = c:) out=Sorted_claims; by claim; Leaving out unwanted rows and columns can produce dramatic performance improvements. SAUSAG 69 – 20 Feb 2014

  7. Sorting – Proc Sort vsProc SQL /* SORT Procedure */ procsortdata=claims; byclient claim; run; /* SQL Procedure */ procsql; create table claims as select * from claims order by client claim; quit; • Both will sort your data. • No significant performance difference. • Choose according to clarity, functional requirement and efficiency. • Make it as clear and simple as possible! SAUSAG 69 – 20 Feb 2014

  8. Sorted Status of a Data Set procsortdata=claims; by claim client; Sort status is saved as part of a SAS data set. Sort Information Sortedby CLAIM CLIENT Validated YES Character Set ANSI So SAS won’t waste time re-sorting if it’s already in the required order. SAUSAG 69 – 20 Feb 2014

  9. Setting Sorted Status of a Data Set dataclient_claims (sortedby = client ); merge clients claims; by client ; If you know a data set is sorted, say so with the SORTEDBY= option!. Sort Information Sortedby CLIENT Validated NO Character Set ANSI So SAS won’t waste time re-sorting later. SAUSAG 69 – 20 Feb 2014

  10. Presorted or Notsorted procsortdata=claims out=sorted presorted; by claim; • PRESORTED option for when data probably sorted! • SAS will check and only sort if necessary. procprintdata=grouped_claims; by claim NOTSORTED; • No need to sort if data is grouped BY the required variable – it doesn’t matter its NOTSORTED • (you just have to say so). SAUSAG 69 – 20 Feb 2014

  11. Sorting and Maintaining Order procsortdata=claims; by claim ; • By default, SAS maintains the original order of records within a BY group. procsortdata=claims noequals; by claim ; • Using the NOEQUALS option means SAS won’t necessarily retain the original ordering. • More efficient but, directly affects the results of using NODUPKEY SAUSAG 69 – 20 Feb 2014

  12. Sorting Duplicates procsortdata=claims out=no_duplicates nodupkey; by claim; NODUPKEY effectively keeps the first record of any duplicates. procsortdata=claims out=no_duplicates dupout=dups nodupkey; by claim; DUPOUT= puts the duplicates to a separate table. SAUSAG 69 – 20 Feb 2014

  13. Separating Unique & Duplicate Rows procsortdata=claimsout=sorted ; by claim; run; dataunique_claims dup_claims; set sorted; by claim; iffirst.claim and last.claimthen outputunique_claims; else outputdup_claims; run; It works, but needs an extra pass of the data. SAUSAG 69 – 20 Feb 2014

  14. Separating Unique & Duplicate Rows- the smarter way NOUNIQUEKEY ensures no records with a unique key are written to the OUT= table. procsortdata=claims out=duplicates uniqueout=uniques nouniquekey; by claim; run; …and the UNIQUEOUT= option directs the unique records to a separate table SAUSAG 69 – 20 Feb 2014

  15. Sorting – Case Insensitive procsortdata=names out=simply_sorted; by name; Upper case letters are before lower case in the ASCII collating sequence. data names2; set names; upcase_name = upcase(name); procsortdata=names2 out=upcase_sorted(keep=name); byupcase_name; Creating an upper (or lower) case copy of the variable is the old solution. SAUSAG 69 – 20 Feb 2014

  16. Sorting – Case Insensitive - Smarter SORTSEQ option specifies the collating sequence (ASCII/EBCDIC/other languages) or, LINGUISTIC option modifies the current collating sequence. procsortdata=names out=linguistic_sorted sortseq=linguistic; by name; The affect is to make the sort case insensitive. SAUSAG 69 – 20 Feb 2014

  17. Sorting – Case Insensitive – with SQL PROC SQL allows the use of functions in the Order By (and other) clauses. procsql; createtablesql_sortedas select * from names orderbyupcase(name); The result is different from Proc SORT using the sorteq=linguistic. SAUSAG 69 – 20 Feb 2014

  18. Sorting Out Spaces procsortdata=names out=simply_sorted; by name; A standard sort is obviously no use. datanames_temp; set names; temp_name = upcase(compress(name)); run; procsortdata=names_temp out=temp_sorted(keep=name); bytemp_name; Creating another variable for sorting, without spaces, is the old solution.

  19. Sorting Out Spaces procsql; createtablesql_sortedas select * from names orderbyupcase(compress(name)); Proc SQL can do it too. procsortdata=namesout=alt_handling_sorted sortseq = linguistic(alternate_handling = shifted); by name; Proc SORT can too! This sub-option of the LINGUISTIC sortseq option, effectively ignores spaces as well as being case-insensitive. SAUSAG 69 – 20 Feb 2014

  20. Sorting by Numbers Sorting text with numeric prefixes e.g. student id and name … procsortdata=students out=simply_sorted; by student; … results in nothing useful! SAUSAG 69 – 20 Feb 2014

  21. Sorting by Numbers datastudents_temp; set students; student_num = input(scan(student,1), 2.); run; procsortdata=students_temp out=temp_sorted(keep=student); bystudent_num; An extra data step can create a numeric variable to sort with (as can SQL of course) procsql; createtablesql_sortedas select * from students orderby input(scan(student,1), 2.); SAUSAG 69 – 20 Feb 2014

  22. Sorting by Numbers procsortdata=students out=num_collation_sorted sortseq = linguistic (numeric_collation=on); by student; The numeric_collation sub-option of the LINGUISTIC sortseq option, sorts by the numeric values that prefix the variable values. SAUSAG 69 – 20 Feb 2014

  23. Questions? • Did you learn something new from this presentation? SAUSAG 69 – 20 Feb 2014

More Related