1 / 22

Profile-Driven Selective Program Loading

This research aims to reduce memory footprint by selectively loading parts of shared libraries, targeting Unix/Linux systems and parallel programs running on multiple nodes.

ckeaton
Download Presentation

Profile-Driven Selective Program Loading

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profile-Driven Selective Program Loading Tugrul Ince tugrul@cs.umd.edu Jeff Hollingsworth Department of Computer Science University of Maryland, College Park, MD 20742

  2. Motivation • Programs are getting larger! • Many frameworks and libraries • Many supercomputers lack demand-paging • Example: Cray XT and BlueGene series • Available memory is scarce • Observation: Most programs do not use every available function! • Frameworks and libraries are too general • Code that handles errors or special cases • Why not remove functions that are not used in the common case?

  3. Aim Reduce memory footprint by selectively loading parts of shared libraries

  4. Target Platforms and Applications • Unix/Linux systems that support ELF • Modifies ELF program headers • Applications with many libraries • Most current reasonable applications • Parallel programs running on multiple nodes • MPI etc. • Platforms without demand-paging • Cray XT and BlueGene series

  5. Architecture Overview • Application is profiled. • It is rewritten with • Modified Shared Libraries • A Signal Handler • Application is executed as usual.

  6. Profiler • Need a list of never-called functions in each shared library • Profile the application several times • May not be perfect • DynInst-based profiler • Write small program (~ 70 LOC) • Rewrite shared libraries • Profile as many times as necessary

  7. Rewriting • Do not load unused functions • Modify ELF program headers • Example: libpetsc.so .text Program Headers: Type Offset VirtAddrPhysAddrFileSizMemSizFlg Align LOAD 0x000000 0x00000000 0x000000000x090000 0x090000 R E 0x1000 LOAD 0x112000 0x00112000 0x00112000 0x012584 0x012584 R E 0x1000 Program Headers: Type Offset VirtAddrPhysAddrFileSizMemSizFlg Align LOAD 0x000000 0x00000000 0x00000000 0x124584 0x124584 R E 0x1000 LOAD 0x124584 0x00125584 0x00125584 0x013f8 0x0a434 RW 0x1000 DYNAMIC 0x12459c 0x0012559c 0x0012559c 0x00130 0x00130 RW 0x4 GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4 • First Loadable Section: • .text, .init, .fini, .plt • Second Loadable Section: • .dynamic, .got, .got.plt, .data, .bss

  8. Rewriting • Do not load unused functions • Modify ELF program headers • Example: libpetsc.so .text Program Headers: Type Offset VirtAddrPhysAddrFileSizMemSizFlg Align LOAD 0x000000 0x00000000 0x000000000x090000 0x090000 R E 0x1000 LOAD 0x112000 0x00112000 0x00112000 0x012584 0x012584 R E 0x1000 LOAD 0x124584 0x00125584 0x00125584 0x013f8 0x0a434 RW 0x1000 DYNAMIC 0x12459c 0x0012559c 0x0012559c 0x00130 0x00130 RW 0x4 GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x4 • First Loadable Section: • .text, .init, .fini, .plt • Second Loadable Section: • .dynamic, .got, .got.plt, .data, .bss

  9. Rewriting • Rewriter based on DynInst • Profile data is used to create lists of Used and Unused functions • Access / Modify symbols • Defragment functions to maximize space savings • Requires moving functions inside shared libraries

  10. Function Defragmentation Used Unused

  11. Challenges: Relative Calls • Common way of calling functions in PIC. • If either callee or caller is moved, their relative positioning changes. • Offsets in such relative call instructions need to be updated call d call d’ d d' foo foo

  12. Challenges: Symbols • Runtime linker uses symbols to resolve cross-library calls. • Uses procedure linkage tables (plt) • If a function is moved, its associated symbol has to be updated. foo: 0xdeadbeef foo: 0xbeefdead foo@plt foo@plt foo call foo@plt call foo@plt foo

  13. Challenges: Jump Tables • Used to represent n-way branches at machine level • Targets are read from jump table • Entries are offsets of targets from the GOT address • Becomes invalid if the function referenced in a jump table is moved • DynInst reads jump tables to generate CFGs • We update entries so that they can be used to point to new location of targets

  14. Unexpectedly Called Function • Execution is not always predictable • Unexpected function calls • Rewrite original executable with a Signal Handler • Load the function upon an unexpected call • Signal Handler picks up page faults (SIGSEGV) • Loads requested page on-demand • Execution resumes • User-level: No OS modifications

  15. Experiments • Tested on • PETSc ex5 in snes package • PETSc ex2 in ksp package • GS2 • Compiled with debug flag and no optimization • Used Open MPI • Tested on 64-node cluster at UMD • Dual-core x86 processors • Unmodified Linux kernel • Space savings of about 82% on average

  16. PETSc – snes (ex5)

  17. PETSc – snes (ex5)

  18. PETSc – ksp (ex2)

  19. GS2

  20. Running Times • GS2 takes 5 seconds less on average • (36m 38s vs. 36m 33s) • Overhead on PETSc examples • ex2 runs for 2.7 secs, ex5 runs for 1.05 secs.

  21. Running Times • Results suggest no overhead for reasonably-long running programs • Initial cost for signal handler registration • Better instruction cache and TLB performance

  22. Summary • Our tool reduces memory footprint of shared libraries • Rewrite shared libraries with holes • Defragment functions to maximize space savings • On-demand page loading if a not-yet-loaded function is called • About 82% memory space savings for shared libraries • Might improve instruction cache and TLB performance

More Related