1 / 25

Cross-Architectural Performance Portability of a Java Virtual Machine Implementation

Cross-Architectural Performance Portability of a Java Virtual Machine Implementation. Matthias Jacob Princeton University. Keith Randall Google, Inc. JVM architecture. Java Bytecode. Interpreter. JIT. Native Code. JVM. CPU. JVM architecture. Java Bytecode. Interpreter. JIT.

viola
Download Presentation

Cross-Architectural Performance Portability of a Java Virtual Machine Implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-Architectural Performance Portability of a Java Virtual Machine Implementation Matthias JacobPrinceton University Keith RandallGoogle, Inc.

  2. JVM architecture Java Bytecode Interpreter JIT Native Code JVM CPU

  3. JVM architecture Java Bytecode Interpreter JIT Native Code JVM CPU

  4. Compaq FastVM • State-of-the-art implementation of JVM on Alpha • Real 64-bit implementation • Efficient optimization mechanisms • Not feedback-based (as HotSpot) • Can we port the code generator to x86 and preserve the performance ?

  5. Differences Alpha – x86 • Reduced number of registers • 8 registers on x86 versus 31 on Alpha • Instructions contain multiple operations • A single x86 instruction comprises several Alpha instructions • Different addressing modes • Arithmetic x86 instructions operate on memory directly • Non-orthogonality of instruction set • Different registers require different instructions • Source registers get overwritten • Operand registers are used to store results on x86

  6. Outline • Modified Optimizations for x86 • Register Allocation • Instruction Selection • Instruction Patching • Method Inlining • New Optimizations for x86 • Calling Convention • Floating-Point Modes • Results • Conclusion

  7. Register Allocation for JIT • Traditional optimal register allocation too expensive • Graph coloring • Use heuristics • LMAP structure

  8. Register Allocation • Java entities: Local variables Lx and Java stack locations S(y) • Assign every Java entity home location H • Temporary location T for intermediate results

  9. Register Allocation • Limited amount of registers • Flexible partitioning H- / T-registers • No dedicated registers • Thread-local pointer in segment register

  10. Register Allocation • Instructions limited to certain registers • Allocate only subset of registers

  11. Register Allocation • Memory locations as arguments • Pick different addressing mode instead of allocating register

  12. Register Allocation Speedup

  13. Instruction Selection • Alpha/RISC: • ALU operations • Memory operations • Control operations • x86/CISC: • Instructions can be combined ALU/Memory/Control operations • Different addressing modes • Limited set of registers per instruction • Emulate 64-bit operations • Floating-point stack

  14. Instruction Patching • Patching instructions • Class initializers • Fix up branches • Copying registers • Method Inlining • Needs to be atomic because of concurrency • Alpha: Every instruction is 4 bytes • single write instruction sufficient

  15. Instruction Patching on x86 • Different instruction lengths • Patch instructions atomically using Compare-and-Exchange • Pad with NOPs • Difficult to walk back in code for renaming registers (as on Alpha) • Input registers are often output registers • Renaming output registers alone is not sufficient • Retargeting by forward-looking heuristic • Look for nearest future use of a preferred register

  16. Method Inlining Speedup

  17. Outline • Modified Optimizations for x86 • Register Allocation • Instruction Selection • Instruction Patching • Method inlining • New Optimizations for x86 • Calling Convention • Floating-Point Modes • Results • Conclusion

  18. Optimizations for x86 • Calling Convention on x86 • Argument passing on stack instead of registers • Allocate registers for argument passing • Two registers for stack management:Frame pointer and Stack pointer • Constant stack frame size • Detection of stack overflow is difficult • Check at bottom of stack frame in method prolog • 8-byte stack operations may be unaligned • Align stack frames to 8 byte boundaries

  19. Optimized stack frame layout Input arguments subl $24, %esp movl %ebx, (%esp) … Return address Callee-save space … Local variables … Output stack arguments … esp Callee-save space (4 bytes) Method prolog: Method epilog: movl (%esp), %ebx addl $24, %esp ret

  20. Floating-Point Modes • Alpha: • Floating-point precision is encoded in instruction • x86: • Toggle floating-point precision explicitly • Heuristically find default setting • Reduce number of toggles

  21. Floating point speedup

  22. Overall speedup

  23. Results Average scenario

  24. Results Best-case scenario

  25. Conclusion • FastVM port to x86 is competitive: • Fastest JVM implementation on javac and jack • Minimal effort on optimizations • Pitfalls, but also advantages • Instruction selection on x86 • Generally easier to generate efficient code for RISC • More architecture-neutral optimizations possible • Register allocation

More Related