1 / 36

Galois Performance

Galois Performance. Mario Mendez- Lojo Donald Nguyen. Overview. Galois system is a test bed to explore opts Safe but not fast out of the box Important optimizations Select least transactional overhead Select right scheduling Select appropriate data structure

rafael-mann
Download Presentation

Galois Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Galois Performance Mario Mendez-Lojo Donald Nguyen

  2. Overview • Galois system is a test bed to explore opts • Safe but not fast out of the box • Important optimizations • Select least transactional overhead • Select right scheduling • Select appropriate data structure • Quantify optimizations on applications

  3. Algorithms general graph 1. Barnes-Hut topology grid 2. Delaunay Mesh Refinement tree 3. Preflow-push morph irregular algorithms local computation operator reader unordered ordering ordered

  4. Methodology • Time Threads Serial • Idle GC Compute • Abort Ratio: Aborted It/Total it • GC options • UseParallelGC • UseParallelOldGC • NewRatio=1

  5. Terms • Base • Default scheduling, Default graph • Serial • Galois classes => No concurrency control classes • Speedup • Best mean performance of a serial variant • Throughput • # Serial Iterations / time

  6. Numbers • Runtime • Last of 5 runs in same VM • Ignore time to read and construct initial graph • Other statistics • Last of 5 runs

  7. Test Environment • 2 x Xeon X5570 (4 core, 2.93 GHz) • Java 1.6.0_0-b11 • Linux 2.6.24-27 x86_64 • 20GB heap size

  8. Barnes-hut Most Distant Galaxy Candidates in the Hubble Ultra Deep Field

  9. Barnes-Hut • N-body algorithm • Oct-tree acceleration structure • Serial • Tree build, center of mass, particle update • Parallel • Force computation • Structure • Reader on tree • Variants • Splash2, Reader Galois

  10. Reader Optimization child = octree.getNeighbor(nn, 1); child = octree.getNeighbor(nn, 1, MethodFlag.NONE);

  11. ParaMeter Profile

  12. Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X 100,000 points, 1 time step

  13. Barnes-Hut Results Best serial: base Serial time: 10271 ms Best // time: 1553 ms Best speedup: 6.6X 100,000 points, 1 time step

  14. Barnes-Hut Scalability

  15. Delaunay Mesh Refinement

  16. Delaunay Mesh Refinement • Refine “bad” triangles • Maintained in worklist • Structure • Cautious operator on graph • Variants • Flag optimized, locallifo base: Priority.defaultOrder() local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)

  17. Cautious Optimization • No need to save undo info • Only check conflicts up to first write mesh.contains(item); ... mesh.remove(preNodes.get(i)); ... mesh.add(node); mesh.contains(item, MethodFlag.CHECK_CONFLICT); ... mesh.remove(preNodes.get(i), MethodFlag.NONE); ... mesh.add(node, MethodFlag.NONE);

  18. LIFO Optimization GaloisRuntime.foreach( ..., Priority.defaultOrder()); GaloisRuntime.foreach( ..., Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class));

  19. ParaMeter Profile

  20. DMR Results Best serial: locallifo.flagopt Serial time: 17002 ms Best // time: 3745 ms Best speedup: 4.5X 0.5M triangles, 0.25M bad triangles

  21. Preflow-Push

  22. Preflow-push • Max-flow algorithm • Nodes push flow downhill • Structure • Cautious, local computation • Variants • Flag optimized, local computation graph • base (discharge): • Priority.first(Bucketed.class, numHeight+1, false, indexer). • then(FIFO.class) • base (relabel): • Priority.first(ChunkedFIFO.class, 8)

  23. Local Computation Optimization graph = ... • graph = ... b = new LocalComputationGraph.ObjectGraphBuilder(); graph = b.from(graph).create()

  24. ParaMeter Profile

  25. Preflow-push Results C: 11450 ms Java: 30234 ms Best serial: lc.flagopt Serial time: 57121 ms Best // time: 18242 ms Best speedup: 3.1X From challenge problem (genmf-wide) 14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edges http://avglab.com/andrew/CATS/maxflow_synthetic.htm

  26. Preflow-push Scalability

  27. What performance did we expect? • Time Measured Indirectly Threads Error //Compute Serial GC • Idle • Miss-Speculation • Synchronization, …

  28. What performance did we expect? • Naïve: r(x) = t1 / x r(x) = tp / x + ts • Amdahl: t1 = tp + ts ts = tidle+ tgc+ tserial • Simple: r(x) = (tp(ix / i1)) / x + ts

  29. Barnes-Hut

  30. Delaunay Mesh Refinement

  31. Preflow-push

  32. Summary • Many profitable optimizations • Selecting among method flags, worklists, graph variants • Open topics • Automation • Static, dynamic and performance analysis • Efficient ordered algorithms

More Related