Cross-Architectural Performance Portability of a Java Virtual Machine Implementation

Cross-Architectural Performance Portability of a Java Virtual Machine Implementation Matthias JacobPrinceton University Keith RandallGoogle, Inc.

JVM architecture Java Bytecode Interpreter JIT Native Code JVM CPU

Compaq FastVM • State-of-the-art implementation of JVM on Alpha • Real 64-bit implementation • Efficient optimization mechanisms • Not feedback-based (as HotSpot) • Can we port the code generator to x86 and preserve the performance ?

Differences Alpha – x86 • Reduced number of registers • 8 registers on x86 versus 31 on Alpha • Instructions contain multiple operations • A single x86 instruction comprises several Alpha instructions • Different addressing modes • Arithmetic x86 instructions operate on memory directly • Non-orthogonality of instruction set • Different registers require different instructions • Source registers get overwritten • Operand registers are used to store results on x86

Outline • Modified Optimizations for x86 • Register Allocation • Instruction Selection • Instruction Patching • Method Inlining • New Optimizations for x86 • Calling Convention • Floating-Point Modes • Results • Conclusion

Register Allocation for JIT • Traditional optimal register allocation too expensive • Graph coloring • Use heuristics • LMAP structure

Register Allocation • Java entities: Local variables Lx and Java stack locations S(y) • Assign every Java entity home location H • Temporary location T for intermediate results

Register Allocation • Limited amount of registers • Flexible partitioning H- / T-registers • No dedicated registers • Thread-local pointer in segment register

Register Allocation • Instructions limited to certain registers • Allocate only subset of registers

Register Allocation • Memory locations as arguments • Pick different addressing mode instead of allocating register

Register Allocation Speedup

Instruction Selection • Alpha/RISC: • ALU operations • Memory operations • Control operations • x86/CISC: • Instructions can be combined ALU/Memory/Control operations • Different addressing modes • Limited set of registers per instruction • Emulate 64-bit operations • Floating-point stack

Instruction Patching • Patching instructions • Class initializers • Fix up branches • Copying registers • Method Inlining • Needs to be atomic because of concurrency • Alpha: Every instruction is 4 bytes • single write instruction sufficient

Instruction Patching on x86 • Different instruction lengths • Patch instructions atomically using Compare-and-Exchange • Pad with NOPs • Difficult to walk back in code for renaming registers (as on Alpha) • Input registers are often output registers • Renaming output registers alone is not sufficient • Retargeting by forward-looking heuristic • Look for nearest future use of a preferred register

Method Inlining Speedup

Outline • Modified Optimizations for x86 • Register Allocation • Instruction Selection • Instruction Patching • Method inlining • New Optimizations for x86 • Calling Convention • Floating-Point Modes • Results • Conclusion

Optimizations for x86 • Calling Convention on x86 • Argument passing on stack instead of registers • Allocate registers for argument passing • Two registers for stack management:Frame pointer and Stack pointer • Constant stack frame size • Detection of stack overflow is difficult • Check at bottom of stack frame in method prolog • 8-byte stack operations may be unaligned • Align stack frames to 8 byte boundaries

Optimized stack frame layout Input arguments subl $24, %esp movl %ebx, (%esp) … Return address Callee-save space … Local variables … Output stack arguments … esp Callee-save space (4 bytes) Method prolog: Method epilog: movl (%esp), %ebx addl $24, %esp ret

Floating-Point Modes • Alpha: • Floating-point precision is encoded in instruction • x86: • Toggle floating-point precision explicitly • Heuristically find default setting • Reduce number of toggles

Floating point speedup

Overall speedup

Results Average scenario

Results Best-case scenario

Conclusion • FastVM port to x86 is competitive: • Fastest JVM implementation on javac and jack • Minimal effort on optimizations • Pitfalls, but also advantages • Instruction selection on x86 • Generally easier to generate efficient code for RISC • More architecture-neutral optimizations possible • Register allocation

Cross-Architectural Performance Portability of a Java Virtual Machine Implementation

Cross-Architectural Performance Portability of a Java Virtual Machine Implementation

Presentation Transcript

Installing Java Virtual Machine

The Java Virtual Machine

Java Virtual Machine (JVM)

Java Virtual Machine

Java Virtual Machine

Dalvik Virtual Machine Vs Java Virtual Machine

Java Virtual Machine (JVM)

Java Virtual Machine

Java Virtual Machine: Instruction Set

Java Virtual Machine ( Obfuscation and Java )

The Java Virtual Machine

Java Virtual Machine

Java Virtual Machine

Java Virtual Machine

Lecture 14 Java Virtual Machine

Java Virtual Machine Profiling

Java and the Java Virtual Machine

Java Virtual Machine (JVM)