A Web service for Distributed Covariance Computation on Astronomy Catalogs

A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D

ROADMAP • Background Information • Interesting Astronomy Data Mining Problems • What has / not been done (Literature review) • My project objectives • The problem of Alignment in astronomy catalogs • The Fundamental Plane • A case study for recreating the Fundamental Plane from astronomy catalogs • Experimental Results • Efforts towards building Web services

Background Information • Next generation Astronomy catalogs will contain data for most of the sky • Existing astronomy sky surveys – SDSS, 2Mass, FIRST, etc • Terabytes and Peta bytes of Data • Data Avalanche in Astronomy • Getting useful information is like looking for a needle in a haystack • National Virtual Observatory (NVO) has been set up to facilitate scientific discovery • Obvious need for Distributed Data Mining

What kind of Data Mining activities are astronomers interested in ? • Detection of transient objects such as supernovae (Online transient object detection in real time) • Obtain statistics of variable and moving objects (model variability, refine existing models, fit models to irregularly sampled data ) • Parameterize shapes of objects using rotationally invariant quantities • Efficient cluster and outlier detection • Supervised Data Mining problems (match objects detected in multiple bands, derive photometric red shifts)

What has/not been done • Lot of efforts in centralized data mining (NVO, FMass, Class X, FIRST etc ) • Some grid mining (Notable GRIST project) • Very few distributed data mining efforts in their preliminary stages (http://www.cs.queensu.ca/home/mcconell/DDMAstro.html)

Objectives of this project • Aligning of Catalogs (The Fundamental Plane Problem) • Implementation of algorithms for Distributed Data Mining on Astronomy Catalogs • Development of webservices for the catalogs / investigation into what needs to be done to integrate this into the NVO

Alignment of Astronomy Catalogs Cross matching is a non trivial problem in itself. We assume cross matching happens off line and there exists an indexing scheme by which catalogs know the exact cross matched tuples

Some interesting numbers • Size of current SDSS catalogs 3.0 TB , contains about 180 million objects (As per Data Release 4) • 2Mass has already observed 99% of the sky and reports 470,992,970 Point sources and 1,647,599 Extended sources Portion of the sky observed by SDSS

Problems Cross Matching is an inherently difficult problem for the astronomy catalogs We assume data sets are cross matched and this computation is done offline This is a strong assumption and often may not be acceptable to astronomers

A real life cross matching Exercise Problems encountered • Which catalogs to use ? • We tried several - SDSS, 2Mass, HyperLeda, CfA RedShift Catalog • Catalogs have different indexing schemes – more recent ones use HTM (Hierarchical Triangular Mesh), others use (ra,dec) or even Names of objects • Some attributes are really not available ! (SDSS has -9999 for most of its red shift values) • Different catalogs observe different portions of the sky (SDSS covers only about 16% of the sky in the latest release while 2Mass covers the entire sky) – Select subsets to cross match wisely !

The successful cross matching ….. • Chose a region of the sky between 0 and 15 (dec) and 150 and 200 degrees (ra) – observed by both SDSS and 2Mass • Use a web interface provided by SDSS to do the cross matching • Selected the K-band for obtaining red shift and surface brightness (astronomical significance) Case Study • Centralized database 1249 cross matched objects • Attributes are size, surface brightness, velocity dispersion • Does not really make a case for a distributed data mining scenario ! Solution - try a larger subset of the data from both catalogs

The Fundamental Plane • Interesting problem in astronomy - Identify correlations in high dimensional spaces • For the class of elliptical and spiral galaxies Observed features – radius, mean surface brightness and central velocity dispersion A two dimensional plane in the observed space of 3D parameters exist called THE FUNDAMENTAL PLANE

An illustration of the Fundamental Plane

Experimental Results • First PC captured 69.4193% of variance • Second PC captured 12.1333% of the variance • The astronomy literature suggests 1st and 2nd PC together should capture about 88% of variance Reasonably close recreation of the Fundamental Plane from two cross matched data sets in the centralized setting

Algorithm for Distributed Covariance Computation • A central co-ordination site S sends A and B a random number generation seed • A and B generate and n X l Random matrix R where l << n • A and B send S – R T A and R T B • S computes ( R A )T (RB) / n

Experimental Results – Distributed Setting Case Study • 1249 attributes at site A and B • 2 attributes at site A and 1 attribute at site B

More results

Development of a Web Service Architecture of the Proposed System SITE A Soap Message WEB SERVICE For Distributed Covariance Computation CLIENT Soap Message SITE B

Current Implementation • Using Apache Axis (SOAP engine – a framework for making SOAP processors such as clients, servers ) • Tomcat version 4.1 • SOAP version 1.2 • Short Demo • Further System Developmental Issues (use of SOAP with attachments)

QUESTIONS ?

A Web service for Distributed Covariance Computation on Astronomy Catalogs

A Web service for Distributed Covariance Computation on Astronomy Catalogs

Presentation Transcript

Covariance

Covariance NMR Metabolomics Web Portal

P4P: A Practical Framework for Privacy-Preserving Distributed Computation

Smart Redundancy for Distributed Computation

A Distributed Framework for Computation on the Results of Large Scale NLP

iMapReduce : A Distributed Computing Framework for Iterative Computation

Fabric A Platform for Secure Distributed Computation and Storage

Distributed Computation in MANets

Distributed File and Metadata Catalogs Peter Kunszt

Sky Query : A distributed query engine for astronomy

DISTRIBUTED DATA MINING ON ASTRONOMY CATALOGS

A Tight Unconditional Lower Bound on Distributed Random Walk Computation

Hop - A platform for distributed Web applications

Depth-Bounded Communication Complexity for Distributed Computation

SkyQuery : distributed probabilistic join infrastructure for astronomy

Scalable Secure Distributed Computation

A Tight Unconditional Lower Bound on Distributed Random Walk Computation

Reliable Service Scheduler in a Distributed Web Service Environment

Catalogs

Computation in a Distributed Information Market

Specifying distributed computation

A web application for distributed computing