Jungju Oh Guru Venkataramani Christopher Hughes Milos Prvulovic

A Framework for Debugging LoadImbalanceinMultithreadedExecution LIME Jungju Oh Guru Venkataramani Christopher Hughes Milos Prvulovic Georgia Tech GWU Intel Georgia Tech

INTRODUCTION Many-core era isNOW! Many parallel applications suffer from performance problems Scalability limiters • Before : More performance if you buya new processor • After : More performance if you usethe new cores Not easy… Power wall • Inhibits performance gains when adding cores • Load imbalance is perhaps most common limiter Many-core era! Good old days… *from Wikipedia

LOAD IMBALANCE Uneven distribution of work on each core Ideally… Parallel application assigns an equal amount of work In Reality… Some have more iterations of loop Some execute if-statement block Some have cache misses Some have branch misprediction … Easy to detect, hard to fix! Workload in Serial Ideal Reality Load imbalance

WHY SO DIFFICULT? Change one-by-one Huge amount of effort 535 if (MyNum== (...)) { 540 while ((offset & 0x2) != 0) { ... } 549 while ((offset & 0x3) != 0) { ... } 557 for (i = 0; i < radix2; i++) { ... } 566 while ((offset & 0x1) == 0) { ... } 575 for(i = 0; i < radix3; i++) { ... } 578 while (offset == 0) { 579 if ((offset & 0x2) == 0) { 582 for (i = 0; i < radix4; i++) { ... } 585 } 589 } 590 for (i = 1; i < radix5; i++) { ... } 594 if ((MyNum!= 0) || (stats2)) { ... } 534 BARRIER(...); 535 if (MyNum != (...)) { 540 while ((offset & 0x1) != 0) { ... } 549 while ((offset & 0x1) != 0) { ... } 557 for (i = 0; i < radix; i++) { ... } 560 } else { …} 566 while ((offset & 0x1) != 0) { ... } 575 for(i = 0; i < radix; i++) { ... } 578 while (offset != 0) { 579 if ((offset & 0x1) != 0) { 582 for (i = 0; i < radix; i++) { ... } 585 } 589 } 590 for (i = 1; i < radix; i++) { ... } 594 if ((MyNum == 0) || (stats)) { ... } 598 BARRIER(...); 638 BARRIER(Global->Barrier,NPROC); 639 640 if ((ProcessId == 0) && (Local[ProcessId].nstep >= 2)) { 641 CLOCK(treebuildstart); 642 } 643 644 /* load bodies into tree */ 645 maketree(ProcessId); 646 if ((ProcessId == 0) && (Local[ProcessId].nstep >= 2)) { 647 CLOCK(treebuildend); 648 Global->treebuildtime += treebuildend - treebuildstart; 649 } 650 651 Housekeep(ProcessId); 652 653 Cavg = (real) Cost(Global->G_root) / (real)NPROC ; 654 Local[ProcessId].workMin = (long) (Cavg * ProcessId); 655 Local[ProcessId].workMax = (long) (Cavg * (ProcessId + 1) 656 + (ProcessId == (NPROC - 1))); 657 658 if ((ProcessId == 0) && (Local[ProcessId].nstep >= 2)) { 659 CLOCK(partitionstart); 660 } 661 662 Local[ProcessId].mynbody = 0; 663 find_my_bodies(Global->G_root, 0, BRC_FUC, ProcessId ); 664 665 /* B*RRIER(Global->Barcom,NPROC); */ 666 if ((ProcessId == 0) && (Local[ProcessId].nstep >= 2)) { 667 CLOCK(partitionend); 668 Global->partitiontime += partitionend - partitionstart; 669 } 670 671 if ((ProcessId == 0) && (Local[ProcessId].nstep >= 2)) { 672 CLOCK(forcecalcstart); 673 } 674 675 ComputeForces(ProcessId); 676 677 if ((ProcessId == 0) && (Local[ProcessId].nstep >= 2)) { 678 CLOCK(forcecalcend); 679 Global->forcecalctime += forcecalcend - forcecalcstart; 680 } 681 682 /* advance my bodies */ 683 for (pp = Local[ProcessId].mybodytab; 684 pp < Local[ProcessId].mybodytab+Local[ProcessId].mynbody; pp++) { 685 p = *pp; 686 MULVS(dvel, Acc(p), dthf); 687 ADDV(vel1, Vel(p), dvel); 688 MULVS(dpos, vel1, dtime); 689 ADDV(Pos(p), Pos(p), dpos); ADDV(Vel(p), vel1, dvel); 691 692 for (i = 0; i < NDIM; i++) { 693 if (Pos(p)[i]<Local[ProcessId].min[i]) { 694 Local[ProcessId].min[i]=Pos(p)[i]; 695 } 696 if (Pos(p)[i]>Local[ProcessId].max[i]) { 697 Local[ProcessId].max[i]=Pos(p)[i] ; 698 } 699 } 700 } 701 LOCK(Global->CountLock); 702 for (i = 0; i < NDIM; i++) { 703 if (Global->min[i] > Local[ProcessId].min[i]) { 704 Global->min[i] = Local[ProcessId].min[i]; 705 } 706 if (Global->max[i] < Local[ProcessId].max[i]) { 707 Global->max[i] = Local[ProcessId].max[i]; 708 } 709 } 710 UNLOCK(Global->CountLock); 711 715 BARRIER(Global->Barrier,NPROC); 716 717 if ((ProcessId == 0) && (Local[ProcessId].nstep >= 2)) { 718 CLOCK(trackend); 719 Global->tracktime += trackend - trackstart; 720 } 721 if (ProcessId==0) { 722 Global->rsize=0; 723 SUBV(Global->max,Global->max,Global->min); 724 for (i = 0; i < NDIM; i++) { 725 if (Global->rsize < Global->max[i]) { 726 Global->rsize = Global->max[i]; 727 } 728 } 729 ADDVS(Global->rmin,Global->min,-Global->rsize/100000.0); 730 Global->rsize = 1.00002*Global->rsize; 731 SETVS(Global->min,1E99); 732 SETVS(Global->max,-1E99); 733 } 734 Local[ProcessId].nstep++; 735 Local[ProcessId].tnow = Local[ProcessId].tnow + dtime; 736 } 737 738 739 740 void ComputeForces(long ProcessId) 741 { 742 bodyptr p,*pp; 743 vector acc1, dacc, dvel; 744 745 for (pp = Local[ProcessId].mybodytab; 746 pp < Local[ProcessId].mybodytab+Local[ProcessId].mynbody;pp++) { 747 p = *pp; 748 SETV(acc1, Acc(p)); 749 Cost(p)=0; 750 hackgrav(p,ProcessId); 751 Local[ProcessId].myn2bcalc += Local[ProcessId].myn2bterm; 752 Local[ProcessId].mynbccalc += Local[ProcessId].mynbcterm; 753 if (!Local[ProcessId].skipself) { /* did we miss self-int? */ 754 Local[ProcessId].myselfint++; /* count another goofup */ 755 } 756 if (Local[ProcessId].nstep > 0) { 757 /* use change in accel to make 2nd order correction to vel */ 758 SUBV(dacc, Acc(p), acc1); 759 MULVS(dvel, dacc, dthf); 760 ADDV(Vel(p), Vel(p), dvel); 761 } 762 } 763 } 764 765 /* 766 * FIND_MY_INITIAL_BODIES: puts into mybodytab the initial list of bodies 767 * assigned to the processor. 768 */ 769 770 void find_my_initial_bodies(bodyptrbtab, long nbody, long ProcessId) 771 { 772 long extra,offset,i; 773 774 Local[ProcessId].mynbody = nbody / NPROC; 775 extra = nbody % NPROC; 776 if (ProcessId < extra) { 777 Local[ProcessId].mynbody++; 778 offset = Local[ProcessId].mynbody * ProcessId; 779 } 780 if (ProcessId >= extra) { 781 offset = (Local[ProcessId].mynbody+1) * extra + (ProcessId - extra) 782 * Local[ProcessId].mynbody; 783 } 784 for (i=0; i < Local[ProcessId].mynbody; i++) { 785 Local[ProcessId].mybodytab[i] = &(btab[offset+i]); 786 } 787 BARRIER(Global->Barrier,NPROC); 788 } • See whether fixed • Lengthy code • Diverse activities (cache miss, branch prediction…) Too many things to consider!

MOTIVATION Automate this cumbersome process Approach: estimate the relationship between suspects and Load Imbalance • Why human should do the inspection? • Goal: let framework do that • The one with strong relationship may be the culprit

MOTIVATION Lines related to imb. Lines not related to imb. Overlapped lines 534 BARRIER(...); 535 if (MyNum != (...)) { 540 while ((offset & 0x1) != 0) { ... } 549 while ((offset & 0x1) != 0) { ... } 557 for (i = 0; i < radix; i++) { ... } 560 } else { …} 566 while ((offset & 0x1) != 0) { ... } 575 for(i = 0; i < radix; i++) { ... } 578 while (offset != 0) { 579 if ((offset & 0x1) != 0) { 582 for (i = 0; i < radix; i++) { ... } 585 } 589 } 590 for (i = 1; i < radix; i++) { ... } 594 if ((MyNum == 0) || (stats)) { ... } 598 BARRIER(...); • 535 , 579 • 566 , 575, 578 • 540, 549, 557 , 582 Load Imbalance

LIME? A Framework for Debugging Load Imbalance in Multithreaded Execution Do this automatically, quantitatively. Ingredients • Imbalance information • Event counts • Identify which EVENTS are related to the IMBALANCE.

LIME INGREDIENTS Imbalance information Events • Imbalance = 1 - Execution time • In LIME, execution time is used • Control flow decision points • Variable with per-thread counts • How many decisions are made? • Machine-interaction events • How the code interacts? 535 if (MyNum != (...)) { ... } while (offset != 0) { 579 if ((offset & 0x1) != 0) { for (i = 0; i < radix; i++) { .. } 566 while ((offset & 0x1) != 0) { .. } Cache Miss 583 rank_ff_mynum[i]+= l->ranks[i]

LIME RECIPE Caution: Out of Order! Leader Finding Clustering Scoring Regression Multiple Regression • Find linear relationship between and event s Weights tell us important events • Large weight: Var. in event is related to var. in execution time (imb.) • Important • Small weight: Var. in event is not related • Safely ignore

LIME RECIPE Leader Finding Scoring Regression Clustering Regression Analysis tells us important events, but… • Many events have similar trend • Works poorly with those collinearevents Only in thread 1! 535 if (MyNum != (...)) { 540 while ((offset & 0x1) != 0) { ... } 549 while ((offset & 0x1) != 0) { ... } 557 for (i = 0; i < radix; i++) { ... } 560 } Hierarchical Clustering • Merge collinear events Enables Regression Analysis • Reduce number of events Makes Regression Analysis faster! • Before regression analysis

LIME RECIPE Leader Finding Clustering Scoring Regression Clustering + Regression is still not enough… Clustering → Set of events • Inter-related events led to be in same cluster by some event • Independent events happen to be in same cluster 535 if (MyNum != (...)) { 540 while ((offset & 0x1) != 0) { ... } 549 while ((offset & 0x1) != 0) { ... } 557 for (i = 0; i < radix; i++) { ... } 560 } Same for all threads • 131 if (MyNum != (...)){ … } • … • 245 if (MyNum != (...)){ … } • … • 677 if (MyNum != (...)){ … } How can we find that event?

LIME RECIPE Leader Finding Scoring Regression Clustering Conditional branch Loop Finding the event Event leading the cluster = Leader Node Leader Node • Steers other events into cluster • In CFG with clusters • Not inside the cluster on the cluster border • After the event, new cluster begins • {Cluster of incoming edge} ∩{Cluster of outgoing edge}= Ø if for while if while while for for

WHY LEADER NODE? Leader Finding Clustering Scoring Regression More valuable information about the reason of imbalance Clustering + Regression Clustering + Regression + Leader Node Finding • Tells us What events are executed by threads that are late • Not bad, but still too many • Tells us Why is the thread late by identifying steering event • This is what programmers want

SERVING LIME Leader Finding Clustering Regression Scoring Report Leader nodes of clusters related to the imbalance Scoreadjusted Importance of Cluster(weights in Regression Analysis) LIMEis a framework best served with scores.

LIME EVALUATION 15 applications from publically available benchmark suites Event profiling • PARSEC • SPLASH-2 • Simulated program executions with 8–64 cores in detail • Pros: Can count any event we want without affecting the execution (control flow, cache miss, etc…) • Cons: Slow • Pin instrumentation on 8 cores • Fast, but less accurate with limited event types (only control flow)

LIME OVERALL RESULT Small number of reported events High scores for reported events Insufficient input inaccurate output • On average, • 1.4 control flow event • 2.1 cache miss event • On average, • 0.71 for control flow events • 0.96 for cache miss event • Reports with score > 0.1 are listed • Imbalance caused by other events (off-chip bandwidth, bus contention) • These events not in LIME inputs, but they are related to cache misses • LIME found the “closest” reason (cache misses) but with less confidence

LIME RESULT – barnes LIME REPORT Event Score Location 1 0x405470 0.893 grav.C:116 (walksub) -> grav.C:112 (walksub) 2 0x40548c 0.030 grav.C:113 (walksub) 3 0x4055cc 0.024 grav.C:136 (walksub) -> grav.C:114 (walksub) Load imbalance is caused by imbalanced number of recursions in DFS LIME identified it accurately grav.C 105 void walksub(nodeptr n, real dsq, long ProcessId) 106 { 107 nodeptr* nn; 112 if (subdivp(n, dsq, ProcessId)) { // First branch in walksub 113 if (Type(n) == CELL) { 114 for (nn = Subp(n); nn < Subp(n) + NSUB; nn++) { 115 if (*nn != NULL) { 116 walksub(*nn, dsq / 4.0, ProcessId); 117 } 118 } Recursive tree searching

LIME RESULT – LU Some threads spend 90% of its running time in waiting Serious performance degradation due to load imbalance Workload distribution function was ill-designed Better distribution Almost 2x speedup (1.9x) LIME REPORT Event Score Location 1 0x4018b4 0.880 lu.C:668 (lu) lu.C 668 if (BlockOwner(I, J) == MyNum) { /* parcel out blocks */ 669 B = a[K+J*nblocks]; 670 C = a[I+J*nblocks]; 671 bmod(A, B, C, strI, strJ, strK, strI, strK, strI); 672 } of imbalance is removed Distributes workload

LIME RESULT –blackscholes LIME REPORT Event Score Location 1 0x401c18 L1 0.917blackscholes.cpp:323 2 0x401c10 L1 0.917blackscholes.cpp:323 3 0x401c0c L10.917blackscholes.cpp:323 4 0x401bfc L1 0.917blackscholes.cpp:323 … Blackscholes.cpp /* Calling main function to calculate option value */ 321 price = BlkSchlsEqEuroNoDiv( sptprice[i],strike[i], 322 rate[i], volatility[i],otime[i], 323 otype[i], 0); Cache misses in accessing structure of array imbalance Refactoring to array of structure improves cache performance83.3% imbalance gone

CONCLUSIONS Load imbalance Major roadblock in many-core era Inhibits performance scalingScalability limiter Easy to detect, hard to fix LIME A framework for debugging Load Imbalance in Multithreaded Execution Based on clustering and regression analysis Leader node pinpoints performance bug points Quantify importance using scores for programmers

QUESTIONS? Jungju@gatech.edu for more information.

LIME PERFORMANCE So, how fast is it? Optimizations LIME Complexity: • Clustering requires computations • May not scale well for large code base • Cache proximity matrix • Merge multiple events Cubic Quadratic Linear • In practice, betweenlinear and quadratic

Jungju Oh Guru Venkataramani Christopher Hughes Milos Prvulovic

Jungju Oh Guru Venkataramani Christopher Hughes Milos Prvulovic

Presentation Transcript

Hughes Case

Hughes Case

Milos Prvulovic , Zheng Zhang, Josep Torrellas University of Illinois at Urbana-Champaign

Langston Hughes

Lee Hughes

LANGSTON HUGHES

Langston Hughes

Emily Hughes

Milos Prvulovic , Zheng Zhang , Josep Torrellas University of Illinois at Urbana-Champaign

from: angelfire/oh/wallacemt traded: christopher@newsguy [TFBT Weed]

Oh, oh, oh-oh Oh, oh, oh-oh Oh, oh, oh-oh Oh, oh, oh-oh

Brian H. Augustine, Wm. Christopher Hughes, James Madison University DMR-0405345

PDM THEORY BY. MILOS DJURANOVIC

Langston Hughes

Jungju Oh, Alenka Zajic , Milos Prvulovic

Ethel Hughes

Milos Raonic Live Stream

Oh, oh, oh-oh Oh, oh, oh-oh Oh, oh, oh-oh Oh, oh, oh-oh