200 likes | 312 Views
A Web service for Distributed Covariance Computation on Astronomy Catalogs. Presented by Haimonti Dutta CMSC 691D. ROADMAP Background Information Interesting Astronomy Data Mining Problems What has / not been done (Literature review) My project objectives
E N D
A Web service for Distributed Covariance Computation on Astronomy Catalogs Presented by Haimonti Dutta CMSC 691D
ROADMAP • Background Information • Interesting Astronomy Data Mining Problems • What has / not been done (Literature review) • My project objectives • The problem of Alignment in astronomy catalogs • The Fundamental Plane • A case study for recreating the Fundamental Plane from astronomy catalogs • Experimental Results • Efforts towards building Web services
Background Information • Next generation Astronomy catalogs will contain data for most of the sky • Existing astronomy sky surveys – SDSS, 2Mass, FIRST, etc • Terabytes and Peta bytes of Data • Data Avalanche in Astronomy • Getting useful information is like looking for a needle in a haystack • National Virtual Observatory (NVO) has been set up to facilitate scientific discovery • Obvious need for Distributed Data Mining
What kind of Data Mining activities are astronomers interested in ? • Detection of transient objects such as supernovae (Online transient object detection in real time) • Obtain statistics of variable and moving objects (model variability, refine existing models, fit models to irregularly sampled data ) • Parameterize shapes of objects using rotationally invariant quantities • Efficient cluster and outlier detection • Supervised Data Mining problems (match objects detected in multiple bands, derive photometric red shifts)
What has/not been done • Lot of efforts in centralized data mining (NVO, FMass, Class X, FIRST etc ) • Some grid mining (Notable GRIST project) • Very few distributed data mining efforts in their preliminary stages (http://www.cs.queensu.ca/home/mcconell/DDMAstro.html)
Objectives of this project • Aligning of Catalogs (The Fundamental Plane Problem) • Implementation of algorithms for Distributed Data Mining on Astronomy Catalogs • Development of webservices for the catalogs / investigation into what needs to be done to integrate this into the NVO
Alignment of Astronomy Catalogs Cross matching is a non trivial problem in itself. We assume cross matching happens off line and there exists an indexing scheme by which catalogs know the exact cross matched tuples
Some interesting numbers • Size of current SDSS catalogs 3.0 TB , contains about 180 million objects (As per Data Release 4) • 2Mass has already observed 99% of the sky and reports 470,992,970 Point sources and 1,647,599 Extended sources Portion of the sky observed by SDSS
Problems Cross Matching is an inherently difficult problem for the astronomy catalogs We assume data sets are cross matched and this computation is done offline This is a strong assumption and often may not be acceptable to astronomers
A real life cross matching Exercise Problems encountered • Which catalogs to use ? • We tried several - SDSS, 2Mass, HyperLeda, CfA RedShift Catalog • Catalogs have different indexing schemes – more recent ones use HTM (Hierarchical Triangular Mesh), others use (ra,dec) or even Names of objects • Some attributes are really not available ! (SDSS has -9999 for most of its red shift values) • Different catalogs observe different portions of the sky (SDSS covers only about 16% of the sky in the latest release while 2Mass covers the entire sky) – Select subsets to cross match wisely !
The successful cross matching ….. • Chose a region of the sky between 0 and 15 (dec) and 150 and 200 degrees (ra) – observed by both SDSS and 2Mass • Use a web interface provided by SDSS to do the cross matching • Selected the K-band for obtaining red shift and surface brightness (astronomical significance) Case Study • Centralized database 1249 cross matched objects • Attributes are size, surface brightness, velocity dispersion • Does not really make a case for a distributed data mining scenario ! Solution - try a larger subset of the data from both catalogs
The Fundamental Plane • Interesting problem in astronomy - Identify correlations in high dimensional spaces • For the class of elliptical and spiral galaxies Observed features – radius, mean surface brightness and central velocity dispersion A two dimensional plane in the observed space of 3D parameters exist called THE FUNDAMENTAL PLANE
Experimental Results • First PC captured 69.4193% of variance • Second PC captured 12.1333% of the variance • The astronomy literature suggests 1st and 2nd PC together should capture about 88% of variance Reasonably close recreation of the Fundamental Plane from two cross matched data sets in the centralized setting
Algorithm for Distributed Covariance Computation • A central co-ordination site S sends A and B a random number generation seed • A and B generate and n X l Random matrix R where l << n • A and B send S – R T A and R T B • S computes ( R A )T (RB) / n
Experimental Results – Distributed Setting Case Study • 1249 attributes at site A and B • 2 attributes at site A and 1 attribute at site B
Development of a Web Service Architecture of the Proposed System SITE A Soap Message WEB SERVICE For Distributed Covariance Computation CLIENT Soap Message SITE B
Current Implementation • Using Apache Axis (SOAP engine – a framework for making SOAP processors such as clients, servers ) • Tomcat version 4.1 • SOAP version 1.2 • Short Demo • Further System Developmental Issues (use of SOAP with attachments)