150 likes | 238 Views
Improving Parallelism in Structural Data Mining. Min Cai , Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, U.S.A. Who am I?. Min Cai: cmin@cs.okstate.edu
E N D
Improving Parallelism in Structural Data Mining Min Cai, Istvan Jonyer, Marcin Paprzycki Computer Science Department, Oklahoma State University, Stillwater, Oklahoma 74078, U.S.A.
Who am I? • Min Cai: cmin@cs.okstate.edu • Ph.D. Student of Computer Science Department at Oklahoma State University • Research Interests: • Parallel and distributed computing • Data Mining
Introduction • Data warehouses of increasing size • Data mining technique for discovering interesting properties in data • structural data mining • data represented as a graph • aim substructure discovery finding “interesting” and recurring subgraphs in a labeled graph
SUBDUE (1) • Discovers substructures utilizing minimum description length (MDL) principle • Cook, D.J., Holder, L.B., G alal, G., Maglothin, R.: Approaches to Parallel Graph-Based Knowledge Discovery. Journal of Parallel and Distributed Computing, 61(3) (2001) 427-446 • Data objects graph vertices • Relationships graph edges • Substructure connected subgraph • NOTE graph algorithms are notorious for long execution times
SUBDUE (2) • Algorithm two basic steps • substructure discovery • apply minimal description length (MDL) principle to find the “best” / “most important” structure in the graph • possibly stop here this is the answer • substructure replacement • replace the substructure found in the first step by a single vertex and repeat the process • results • single substructure • hierarchy of substructures
Parallel SUBDUE • Data-parallel approach • Graph divided into subgraphs and send to separate processor • Processors find their best structure and communicate with the rest • The best overall substructure is found • Hierarchical process can be repeated
MPI-SUBDUE • Graph divided into subgraphs using METIS • point-to-point communications (MPI_Send and MPI_Recv) used to communicate between processors • NOTE best structure in data set “7” may be dreadful when confronted with data set “18” • Galal, G.M., Cook, D.J., Holder, L.B.: Improving Scalability in a Knowledge Discovery System by Exploiting Parallelism In the Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (1997) 171-174
NEW-MPI-SUBDUE • Improvements • use PARMETIS to divide the initial graph • use global communication (MPI_Allgatherv) • use binary summation • YES, these changes do not look like much
NEW-MPI-SUBDUE Spawn P(0), P(1), P(2), ... , P(n) Apply PARMETIS to partition G into n partitions for all P(i) where 1 ≤ i ≤ n do discover the best substructure in partition broadcast best substructure to all other processors evaluate best substructure and broadcast results parallel-binary summation of results to find the best overall partition P(0) finds the best overall structure
EXPERIMENTAL SETUP • Mutagenesis data from OxUni • datasets collected in order to predict mutagenicity of aromatic and heteroaromatic nitro compounds • Graph 1 • 2844 vertices and 2883 edges • Graph 2 • 2896 vertices and 2934 edges • Graph 3 • 22268 vertices and 22823 edges • 16 node cluster (32 processors) • two AMD Athlon MP 1800+ (1.6GHz) CPUs, 2 GB of DDR SDRAM, full-backplane Gigabit Ethernet switch • RedHat Linux 9.0, MPICH, Portland Group C compiler 5.0-2
COMMENTS • Graphs 1 and 2 that were large in 2000 are small and “useless” today • Graph 3 gives realistic performance picture gains about 33% • Speedup over original SUBDUE 268 on 32 processors for Graph 3 • this IS “cheating” as some information may be lost due to graph partitioning but… • Graph partitioning and balancing matter
THE END THANK YOU!