Computational AstroStatistics

Computational AstroStatistics • Motivation & Goals • Correlation Functions • Density Estimation Hope to show statisticians the array of interesting problems presently in astronomical data analysis . Hope to show the physicists efficient methods for computing statistics on massive datasets

Collaborators Pittsburgh Computational AstroStatistics (PiCA) Group (http://www.picagroup.org) • Chris Miller, Kathy Romer, Andy Connolly, Andrew Hopkins, Ryan Scranton, David Wake, Mariangela Bernardi (Astro) • Larry Wasserman, Chris Genovese, Kinman Au, Pierpaolo Brutti (Statistics) • Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS) • Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS)

Motivations & Goals“the data flood” • Last decade was dedicated to building more telescopes and instruments; SDSS is terabytes of data a night, • More to come, e.g. LSST is an SDSS every 5 nights! Petabytes of data by end of 00’s New paradigm where we must build new tools before we can analyze & visualize data

SDSS

SDSS Data FACTOR OF 12,000,000 “Glass is half full”

Panchromatic Universe Combining these surveys is hard: different sensitivities, resolutions and physics CMB Mixture of imaging, catalogs and spectra Difference between continuum and point processes Xrays

Virtual Observatory • Federate multi-wavelength data sources (interoperability) • Must empower everyone (democratize) • Be fast, distributed and easy • Allow input and output • Facilitate efficient science See Szalay & Gray 2002

Secondary Motivation • Quality of data demands careful analysis, • Highly correlated datasets and high dimensionality, • Censored, biased and incomplete data sets (redshift-space, flux limited samples) Different from HEP: Our events are all correlated, our background in intrinsically correlated, continuum processes and one experiment 13.7 billion years ago

Computer Science + Statistics • New statistical tools – e.g. non-parametric analyses – are often computationally intensive. • Heavy use of re-sampling or Monte Carlo techniques; need fast and efficient algorithms • Autonomous scientific discovery of large, multi-dimensional, correlated datasets • Distributed computing and fast networks as datasets are distributed • Need for new visualization tools New breed of students needed with IT skills Symbiotic Relationship e.g. PICA

b) N-point correlation functions The 2-point function (x(r))has a long history in cosmology (Peebles 1980). It is the excess joint probability (dP12) of a pair of points over that expected from a Poisson process. dP12 = n2 dV1 dV2 [1 + x(r)] dV2 dV1 r dP123=n3dV1dV2dV3[1+x23(r)+x13(r)+x12(r)+x123(r)]

Motivation for the N-point functions: Measure of the topology of the large-scale structure in universe Same 2pt, very different 3pt

Multi-resolutional KD-trees • Scale to n-dimensions (although for very high dimensions use new tree structures) • Use Cached Representation (store at each node summary sufficient statistics). Compute counts from these statistics • Prune the tree which is stored in memory! (Moore et al. 2001 astro-ph/0012333) • Exact answers as it is all-pairs • Many applications; suite of algorithms!

Top Level 1st Level 2nd Level 5th Level

Also Prune cells inside! Greater saving in time Prune cells outside range Just a set of range searches

Dual Tree Algorithm N1 Usually binned into annuli rmin< r < rmax Thus, for each r transverse both trees and prune pairs of nodes No count dmin < rmax or dmax < rmin N1 x N2 rmin > dmin and rmax< dmax dmax dmin N2 Therefore, only need to calculate pairs cutting the boundaries. Scales to n-point functions also do all r values at once

Survey size bins Timings for 2pt as function of binsize & sample size

3pt Correlation Function Results Re-newed interest in correlation functions because of the halo model (Neymann & Scott; Tanaka & Jain 2002). Also constraints on gaussianity and biasing n(N|M), r(r), Pert. theory

ERROR ON THE N-POINT FUNCTIONS? • The formal error on the 2pt depends on the higher order correlation functions (Peebles 1980; Bernstein 1993; Szapudi et al. 2000) • Variance due to finite numbers, • Boundary effects • Sample bias People presently use jack-knife errors, but these have not been checked due to speeds of codes.

Next steps for Npoint The projected (on the sky) 3pt is invariant to these peculiar velocities

Faster codes How does one compute the 4pt function for a billion galaxies? Need to accept regime of approximate answers. The tree provides a new form of stratification for the monte carlo variance-reduction techniques. Build conditional probability functions for the counts and return these probabilities as an approximate answer rather than the true count (Alex Gray 2003) Also use FFT techniques but suffer severe boundary effects

Cross Correlations: ISW Effect The differential gravitational redshift of photons as they pass through an evolving gravitational potential. Direct probe of “Dark Energy” Courtesy of Wayne Hu

SDSS WMAP

Five million pixels: tree code gives answers in less than 600 seconds • Errors from jack-knife and MC realisations (covariances differed by upto 50%) • Use of FDR (False Discovery Rate) statistic to test null hypothesis (ideal for highly correlated datasets) [Chris Genovese’s talk] • “Detection” is only 95% (but depends on your test and errors!)

Mark Correlation Functions Powerful Statistic for comparing theory and observations Unaffected by edge effects k(m1,m2) = <m1m2>(r) / <m>2

c) Density Estimation This is important as the distribution of dark matter is a fundamental goal in cosmology and certainly drives many of the physical phenomena we see today. Under the assumption of “light traces mass” , the density of galaxies is a measure of the local density of dark matter.

How does one measure the local density of galaxies? Especially in a heirarchical universe. Also redshift effects

Kth Nearest Neighbor Distance Same signal-to-noise estimator, but adaptive. Really hard to do this in velocity space; we have a “sausage” 1000km/s

KDE Use cross-validation to pick the bandwidth This is really slow for large datasets! Use KD-trees again (see Gray & Moore 2003) Able to do cross validation is seconds for 100,000 data points

S5 good high density regions, bad in low density regions r1.1 good in low density regions, bad in high density regions Use a combination of these, or a combination of fixed bandwidths? But how does one choose? Adaptive approach?

Fast Mixture Models • Describe the data in N-dimensions as a mixture of, say, Gaussians • The parameters of the model are then N gaussians each with a mean and covariance • Iterate, testing using BIC and AIC at each iteration. Fast because of KD-trees (20 mins for 100,000 points on a PC!) • Employ heuristic splitting algorithm as well • Details in Connolly et al. 2000 (astro-ph/0008187)

EM-Based Gaussian Mixture Clustering

Summary • Era of New Cosmology: Massive data sources and search for subtle features & high precision measurements • Need new methods that scale into these new regimes; ``a virtual universe’’ (students will need different skills). Perfect synergy with Stats, CS, Physics • Good algorithms are as good as faster and more computers! • The “glue” to make a “virtual observatory” is hard and complex. Don’t under-estimate the job

Computational AstroStatistics