20 likes | 151 Views
BeBOP. Automatic Performance Tuning of Sparse Matrix Kernels. Richard Vuduc James Demmel Katherine Yelick Yozo Hida. Michael deLorimier Shoaib Kamil Rajesh Nishtala Benjamin Lee. Be rkeley B enchmarking and OP timization Group www.cs.berkeley.edu/~richie/bebop. Context
E N D
BeBOP Automatic Performance Tuning of Sparse Matrix Kernels Richard Vuduc James Demmel Katherine Yelick Yozo Hida Michael deLorimier Shoaib Kamil Rajesh Nishtala Benjamin Lee Berkeley Benchmarking and OPtimization Group www.cs.berkeley.edu/~richie/bebop Context The performance of many applications is dominated by a few computational kernels. Extensions to New Kernels Preliminary results for symmetric A, ATA, and triangular solve. Future Work Integrating with applications, new kernels, further automation, understanding architectural implications. An Approach to Automatic Tuning For each kernel, identify and generate a space of implementations, and search for the best one. Tuning Sparse Matrix-Vector Multiply The SPARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax, where A is sparse. • Applications need fast kernels • Scientific computing; information retrieval Dense and sparse linear algebra • Multimedia; audio and image processing Fast transforms • Databases Sorting • Security “Cryptokernels” (e.g., modular exponentiation) • Hardware complexity is increasing • Microprocessor performance difficult to model • Widening processor-memory gap; deep memory hierarchies • Implementation space • Conceptually, the set of “interesting” implementations • Depend on kernel and input • May vary: • Instruction mix and order • Memory access patterns • Data structures and precisions • Mathematical formulation • Searching • How? • Exhaustively • Heuristically, guided by models • When? • Once per kernel and platform • At compile time • At run-time • Hybrid approaches • Sparse matrix data structures • Store only non-zeros • Data structure overhead + irregular memory access • Implementation space • Register blocking: exploit existing dense blocks • Cache blocking: create better locality in x, y • Multiple vectors: reuse elements of A • Searching example: selecting a register block size • Off-line: One-time characterization of performance on a dense matrix stored in sparse format for all r, c • At run-time: Estimate rxc fill, and choose r, c to maximize Mflopsdense(r,c) / Fill(r,c) Multiple vectors—Significant speedups are possible when multiplying by several vectors. Register blocking example—Portion of a sparse matrix with a 4x3 register block grid superimposed. ATA times a vector—The matrix A is brought through the memory hierarchy only once. Cache blocking—Performance at various cache block sizes on a latent semantic indexing matrix. Sparse triangular solve—Implementation design space includes SSE2 instructions and “switch-to-dense.” Exploiting new structures—Symmetric matrix from a fluid flow modeling problem (left); triangular matrix from LU factorization (right). Exploiting symmetry—When A=AT, only half the matrix need be stored, and each element used twice. SPARSITY—Performance improvement after run-time block-size selection. Why do these four profiles look so different?—We hope to answer this question and understand the implications for current and future architectures and applications. Register blocking profile—One-time characterization of the machine (Mflop/s). Observations and Experience Performance tuning is tedious and time-consuming work. This approach has been applied successfully in dense linear algebra (PHiPAC ‘97; ATLAS ‘98) and signal processing (FFTW ‘98; SPIRAL ‘00), among others. • Symmetric sparse matrix-vector multiply • Only store half of the non-zeros • Reuse each stored element twice • Sparse triangular solve • Compute T-1x, where T is a sparse triangular matrix • Exploit naturally occurring dense structure when T comes from certain applications (LU factorization) • Multiplying ATA by a vector • A can be brought through the memory hierarchy once • Arises in linear programming problems, among others (SPARSITY system optimizations also apply to these kernels!) • Integration with applications and existing software libraries • Creating (via reordering) or exploiting other matrix structures • New sparse kernels (e.g., powers Ak, triple product ATMA) • Further automation: generating implementation generators • Understanding performance in terms of underlying architectures Platform variability—Distribution of performance over dense matrix multiply implementations on 8 different platforms (architecture + compiler): Performance tuning for any one platform must be redone for each new platform. Needle in a haystack—Planar slice of a large space of mathematically equivalent dense matrix multiply implementations: Each square is an implementation color-coded by its performance (Mflop/s) on a 333 MHz Sun Ultra-IIi based workstation. It is not obvious how to model this space.