High Performance Large Scale Datasets Clustering Based on Gustaffson-Kessel Fuzzy Technique

High Performance Large Scale Datasets Clustering Based on Gustaffson-Kessel Fuzzy Technique Lycom Dmitri1, Trzasala Milena2, Students Gr.1 -8BM23, 2 –AO-184 1-Tomsk Polytechnic University, Russia 2-Wroclaw University of Technology, Poland Scientific advisor: Dr. Sergey V. Axyonov

European Space Agency (ESA) mission Launch Date: 2013 Mission End: after 5 years (2018) Launch Vehicle: Soyuz-Fregat Orbit: Lissajous-type orbit around L2 Gaia project OBJECTIVES: Gaia is an ambitious mission to chart a three-dimensional map of our Galaxy, the Milky Way, in the process revealing the composition, formation and evolution of the Galaxy. MISSION: Produce a stereoscopic and kinematic census of about one billion stars in our Galaxy and throughout the Local Group. Detection and orbital classification of tens of thousands of extra-solar planetary systems. Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Datasets & Attributes Analyzing Stars Features Testing Datasets Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Motivation • Presence of large-scale datasets from the Gaia space project • Need for fast and accurate analysis of received information • The technique can be used to process other huge industrial databases • Objective • High Performance of Vectors Clustering Techniques for Rapid and Reliable ESA Dataset Analysis • Tasks • Design a technique that allows to get effective data distribution and supercomputer processing. • Create a program implementation of the suggested technique that is capable to be used with computational cluster. • Test and estimate the performance of developed software with ESA testing datasets. Objective & Tasks Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Allows to get clusters (groups of points) with any shape. • Based on the points density DBSCAN Clustering technique Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Each process that is connected with linked with a computational node gets just a part of data. Processes can’t detect clusters because neighboring points are located in different memory space Data Distribution Problem Message Passing Interface MPI_Scatter result We suggest Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Allows to generate clusters with hyperellipse shape. ⇒ more natural clustering. • Computes fuzzy matrix that reflects distances between vectors and centers based on points density. ⇒ arrangement of points for indexing and boundary points detection • Base Computations of clustering are Matrix operations ⇒ easy to parallize Gustaffson-Kessel Fuzzy Clustering Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Preclustering • Use a part of data to clusterize ⇒ speed up processing. • Use some computational nodes with different data ⇒ reliability of analysis. Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

The Most General Clustering Search • Detection the most effective splitting in terms of density and distances • between vectors. This distribution reflects the nature of data. Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Each point has a fuzzy vector that reflects distances between the point and centers of clusters. Perform “Join” operation if distances between boundary points in two processes are less the radius of neighborhood. We need just a very small part of clustered points that are located at the borders between clusters ⇒ speedup. Clusters join Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

The program is developed with MS Visual C# • Libraries of Parallel Computing .NET4 • (Multi node • (Multi core processing) • Matrix operations in GK clustering • Cluster equivalence (The Most General Clustering search) • Libraries of Distributed Programming MPI.NET • Data Scattering (for Preclustering) • Data Distribution (for DBSCAN Procedure) • Data Broadcasting (The Most General Clustering sending) • Clusters join (Point-to-Point Communication) • The program was tested with SKIF-TPU Computational • Cluster (used 20 computational nodes, 80 processors) Implementation Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Performance Performance of the suggested technique to analyze the Synth-105Dataset in distributed manner. ⇒ The approach really increases the performance of algorithm. Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Designed a new original approach to parallel-distributed large-scale dataset clustering Created a C# program that implements the suggested model based on MPI.NET and parallel libraries of .NET 4. The software was tested to estimate the performance of suggested approach with the Gaia test datasets The suggested technique increase the DBSCAN performance. Parallel-distributed processing is much more effective versus standard DBSCAN method. Results Parallel-Distributed DBSCAN Clustering Technique for High Performance ESA Dataset Processing

Thank you for attention Video segmentation

High Performance Large Scale Datasets Clustering Based on Gustaffson-Kessel Fuzzy Technique

High Performance Large Scale Datasets Clustering Based on Gustaffson-Kessel Fuzzy Technique

Presentation Transcript

Large Scale Navigation Based on Perception

Implementing Evidence- Based Practice on a Large Scale

Very Large-Scale Incremental Clustering

Matching Images to Words for Large-Scale Datasets

Identifying functional subnetworks in large-scale datasets

Flat clustering approaches for high-throughput omic datasets

Large-scale Single-pass k-Means Clustering at Scale

4. Large-Scale Forcing Datasets Large-scale forcings are obtained:

Fuzzy Clustering Algorithms

Tutorial On Fuzzy Clustering

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Interactive Datamining of Large-Scale Screening Datasets

Algorithms for clustering large datasets in arbitrary metric spaces

A Large Scale Medical Volume Rendering on Clustering System

Clustering Very Large Multi-dimensional Datasets with MapReduce

On Large Scale Modeling

Java Performance on Large-Scale Servers

Clustering Large Datasets in Arbitrary Metric Space

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation

Interactive Datamining of Large-Scale Screening Datasets

Very Large-Scale Incremental Clustering