1 / 18

CPE 631 Project Presentation

CPE 631 Project Presentation. Reconfiguration of architectural parameters to maximize performance and using software techniques to reduce cache miss rate. Hussein Alzoubi and Rami Alnamneh. Topics to Be Covered. Part I, Using PAPI:

asta
Download Presentation

CPE 631 Project Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPE 631 Project Presentation Reconfiguration of architectural parameters to maximize performance and using software techniques to reduce cache miss rate Hussein Alzoubi and Rami Alnamneh

  2. Topics to Be Covered • Part I, Using PAPI: • Finding the best blocking factor to reduce cache miss rate • Getting a complete picture of system hardware • Part II: Using SimpleScalar to find the best size of branch predictor • Part III: Getting the best TLB using the SimpleScalar, also

  3. What is PAPI? • Performance Application Programming Interface • Developed at the University of Tennessee’s Innovative Computing Laboratory • Access the hardware performance counters found on most modern microprocessors • Easy to use, well documented, and freely available

  4. Events • Occurrences of specific signals related to a processor’s function • Hardware performance counters exist as a small set of registers that count events while the program executes on the processor such as : • Cache misses • Floating point operations

  5. C calling interface • Function calls are defined in the header file “papi.h” • Consists of the following form : return type PAPI_function_name (arg1,arg2,…) • Return value can be a pointer to structures or a value

  6. PAPI timers • can be used to obtain both real and virtual time • The real time clock runs all the time (e.g. a wall clock) and the virtual time clock runs only when the processor is running in user mode • Real time can be acquired in clock cycles and microseconds by calling the following low-level functions, respectively: PAPI_get_real_cyc() PAPI_get_real_usec()

  7. System information • Executable informationPAPI_get_executable_info()Information about the executable’s address space: • The beginning of the user program • The end of the user program • Hardware information PAPI_get_hardware_info() Information about the system hardware: • Cycle time of processor • Number of processors in the system

  8. Finding the best blocking factor on Bragg and get system information • Use PAPI to find the best block size (using the matrix multiplication) • Measure the number of clock cycles for each block size • Choose the best block size according to the minimum number of clock cycles • Provides system hardware information such as: processor clock rate, number of processors in the system

  9. Results on Bragg system Available hardware information. ------------------------------------------------------------- Vendor string and code : SUN unknown (-1) Model string and code : UltraSPARC I&II (1000) CPU revision : 9.000000 CPU Megahertz : 248.000000 CPU's in an SMP node : 8 Nodes in the system : 1 Total CPU's in the system: 8 ------------------------------------------------------------- Best block size: 8 bfactor: 8 clock cycles 201801712 bfactor: 16 clock cycles 208085422 bfactor: 32 clock cycles 217125792 bfactor: 64 clock cycles 215792624

  10. Part II: branch predictor • modify the Simple Scalar parameters of: L1-I cache, L1-D cache, branch predictor, and branch target buffer • Get 16 different configurations • Using four integer and four floating point SPEC2000 benchmarks with these configuration • Calculate the CPI for each benchmark and every configuration and plot the results

  11. CPI for integer benchmarks

  12. CPI for floating point benchmarks

  13. Average CPI for the integer and floating point benchmarks Config. # 14 Config. # 14: Branch predictor: 16 KB, branch target buffer: 4KB, L1 instruction cache: 32KB, and L1 data cache: 8KB

  14. Part III: TLB • Used instruction TLB varying from 512 to 1024 entries and data TLB varying from 512 to 1024 entries. L1I and L1D cache sizes were also varied • Get 16 different configurations • Run one integer and one floating point SPEC2000 benchmarks for each of these configurations • Find the number of clock cycles for each benchmark and every configuration and plot the results

  15. Number of clock cycles for the integer benchmark

  16. Number of clock cycles for the floating point benchmark

  17. Average number of clock cycles of the integer and floating point benchmarks 16 KB L1 instruction cache, 16 KB L1 data cache, 1024 instruction TLB, and 512 data TLB

  18. Questions? Thank you…

More Related