1 / 24

Where The Rubber Meets the Sky Giving Access to Science Data

Dive into the revolutionary world of eScience with insights on data exploration, unified theory, and advanced statistics. Learn about the TerraServer, a web service operated by Microsoft, offering valuable science data access. Discover how Hungary could lead the way in advancing informatics and integrating vast scientific data online. Explore the challenges of managing petabyte-scale datasets and the importance of efficient data query and visualization tools. Uncover the potential of Smart Data for unifying access and analysis, paving the way for scalable and innovative solutions in science.

mattier
Download Presentation

Where The Rubber Meets the Sky Giving Access to Science Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Where The Rubber Meets the SkyGiving Access to Science Data Jim Gray Microsoft Research Gray@Microsoft.com Http://research.Microsoft.com/~Gray Alex SzalayJohns Hopkins University Szalay@JHU.edu

  2. Outline • Want to build a TerraServer for Hungary? • My view of eScience

  3. TerraServer / TerraServicehttp://terraService.Net/http://TerraServer-USA.com/ • USGS Photo of US • Online since June 1998 • Operated by Microsoft • 20 TB data source • 10 M web hits/day • A web service • Our laboratory • I recommend you clone it for Hungary • 100x less data (92k km2), very useful • Education, land management, scienceInfo framework.

  4. KVM / IP TerraServer – Today – LOW TCO • Storage Bricks • Commodity servers” • 4 TB raw / 2 TB Raid1 SATA storage • Dual 2 Ghz + 4GB RAM • Bunch • 3 Bricks = TerraServer data • Data partitioned • Low Cost Availability Pair & Spare • RAID1 Mirroring • Mirrored Bunches • Spare Brick • Web Application • Load balances mirrors • Uses surviving database on failure

  5. Outline • Want to build a TerraServer for Hungary? • My view of eScience

  6. New Science Paradigms • Thousand years ago: science was empirical describing natural phenomena • Last few hundred years: theoretical branch using models, generalizations • Last few decades:a computational branch simulating complex phenomena • Today: data exploration (eScience) unify theory, experiment, and simulation using data management and statistics • Data captured by instrumentsOr generated by simulator • Processed by software • Scientist analyzes database / files

  7. Information Avalanche and eScience Image courtesy C. Meneveau & A. Szalay @ JHU • In science, industry, government,…. • better observational instruments and • and, better simulations producing a data avalanche • New emphasis on informatics: • Capturing, Organizing, Summarizing, Analyzing, Visualizing • Each science is objectfying itself • Defining core concepts • Integrating all data and literature online • Hungary could be a leader in this (you have the Martians – great tech education ) BaBar, Stanford P&E Gene Sequencer From http://www.genome.uci.edu/ Space Telescope

  8. Experiments & Instruments facts questions facts ? Other Archives facts answers Literature facts Simulations The Big Picture • Data ingest • Managing a petabyte • Common schema • How to organize it? • How to reorganize it? • How to coexist with others? The Big Problems • Data Query and Visualization tools • Support/training • Performance • Execute queries in a minute • Batch (big) query scheduling

  9. The Virtual Observatory • Premise: most data is (or could be online) • The Internet is the world’s best telescope: • It has data on every part of the sky • In every measured spectral band: optical, x-ray, radio.. • As deep as the best instruments (2 years ago). • It is up when you are up • The “seeing” is always great • It’s a smart telescope: links objects and data to literature • Software is the capital expense • Share, standardize, reuse..

  10. Data Mining Algorithms Miners Scientists Science Data & Questions Database To store data Execute Queries Plumbers Question & AnswerVisualization Tools What X-info Needs from us (cs)(not drawn to scale)

  11. Current science practice based on data download (FTP/GREP)Will not scale to the datasets of tomorrow You can GREP 1 MB in a second You can GREP 1 GB in a minute You can GREP 1 TB in 2 days You can GREP 1 PB in 3 years. Oh!, and 1PB ~5,000 disks At some point you need indices to limit searchparallel data search and analysis This is where databases can help You can FTP 1 MB in 1 sec You can FTP 1 GB / min (~1$) … 2 days and 1K$ … 3 years and 1M$ Data Access Hitting a Wall

  12. Next-Generation Data Analysis • Looking for • Needles in haystacks – the Higgs particle • Haystacks: dark matter, dark energy, turbulence, ecosystem dynamics • Needles are easier than haystacks • Global statistics have poor scaling • Correlation functions are N2, likelihood techniques N3 • As data and computers grow at Moore’s Law, we can only keep up with N logN • A way out? • Relax optimal notion (data is fuzzy, answers are approximate) • Don’t assume infinite computational resources or memory • Requires combination of statistics & computer science

  13. Smart Data: Unifying DB and Analysis • There is too much data to move aroundDo data manipulations at database • Build custom procedures and functions into DB • Unify data Access & Analysis • Examples • Statistical sampling and analysis • Temporal and spatial indexing • Pixel processing • Automatic parallelism • Auto (re)organize • Scalable to Petabyte datasets Move Mohamed to the mountain, not the mountain to Mohamed.

  14. Software for Instrument scheduling Instrument control Data gathering Data reduction Database Analysis Visualization Millions of lines of code Repeated for experiment after experiment Not much sharing or learning Let’s work to change this Identify generic tools Workflow schedulers Databases and libraries Analysis packages Visualizers … Experiment Budgets ¼…½ Software Simulation (computational science) are > ½ software

  15. How to Help? • Can’t learn the discipline before you start(takes 4 years.) • Can’t go native – you are a CS person not a bio,… person • Have to learn how to communicateHave to learn the language • Have to form a working relationship with domain expert(s) • Have to find problems that leverage your skills

  16. Working Cross-Culture A Way to Engage With Domain Scientists • Find someone who is desperate for help • Communicate in terms of scenarios • Work on a problem that gives 100x benefit • Weeks/task vs hours/task • Solve 20% of the problem • The other 80% will take decades • Prototype • Go from working-to-working, Always have • Something to show • Clear next steps • Clear goal • Avoid death-by-collaboration-meetings.

  17. Working Cross-Culture -- 20 Questions: A Way to Engage With Domain Scientists • Astronomers proposed 20 questions • Typical of things they want to do • Each would require a week or more in old way (programming in tcl / C++/ FTP) • Goal, make it easy to answer questions • This goal motivates DB and tools design

  18. Q11: Find all elliptical galaxies with spectra that have an anomalous emission line. Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over 60<declination<70, and 200<right ascension<210, on a grid of 2’, and create a map of masks over the same grid. Q13: Create a count of galaxies for each of the HTM triangles which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && r<21.75, output it in a form adequate for visualization. Q14: Find stars with multiple measurements and have magnitude variations >0.1. Scan for stars that have a secondary object (observed at a different time) and compare their magnitudes. Q15: Provide a list of moving objects consistent with an asteroid. Q16: Find all objects similar to the colors of a quasar at 5.5<redshift<6.5. Q17: Find binary stars where at least one of them has the colors of a white dwarf. Q18: Find all objects within 30 arcseconds of one another that have very similar colors: that is where the color ratios u-g, g-r, r-I are less than 0.05m. Q19: Find quasars with a broad absorption line in their spectra and at least one galaxy within 10 arcseconds. Return both the quasars and the galaxies. Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 count of galaxies within 30"of it that have a photoz within 0.05 of that galaxy. Q1: Find all galaxies without unsaturated pixels within 1' of a given point of ra=75.327, dec=21.023 Q2: Find all galaxies with blue surface brightness between and 23 and 25 mag per square arcseconds, and -10<super galactic latitude (sgb) <10, and declination less than zero. Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75. Q4: Find galaxies with an isophotal surface brightness (SB) larger than 24 in the red band, with an ellipticity>0.5, and with the major axis of the ellipse having a declination of between 30” and 60”arc seconds. Q5: Find all galaxies with a deVaucouleours profile (r¼ falloff of intensity on disk) and the photometric colors consistent with an elliptical galaxy. The deVaucouleours profile Q6: Find galaxies that are blended with a star, output the deblended galaxy magnitudes. Q7: Provide a list of star-like objects that are 1% rare. Q8: Find all objects with unclassified spectra. Q9: Find quasars with a line width >2000 km/s and 2.5<redshift<2.7. Q10: Find galaxies with spectra that have an equivalent width in Ha >40Å (Ha is the main hydrogen spectral line.) The 20 Queries Also some good queries at: http://www.sdss.jhu.edu/ScienceArchive/sxqt/sxQT/Example_Queries.html

  19. http://SkyServer.sdss.org • Solves the 20 queries • Has 150 hours of online instruction • Translated to Hungarian • Professional astronomers us it as the SDSS Science Catalog Analysis Service. • Clone operating in Hungary.

  20. SkyQuery (http://skyquery.net/) • Distributed Query tool using a set of web services • Many astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England) • Has grown from 4 to 15 archives,now becoming international standard • Allows queries like: SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2

  21. Each SkyNode publishes Schema Web Service Database Web Service Portal is Plans Query (2 phase) Integrates answers Is itself a web service ImageCutout SkyQuery Portal 2MASS INT SDSS FIRST SkyQuery Structure

  22. MyDB: eScience Workbench • Prototype of bringing analysis to the data • Everybody gets a workspace (database) • Executes analysis at the data • Store intermediate results there • Long queries run in batch • Results shared within groups • Only fetch the final results • Extremely successful – matches work patterns

  23. Summary • Computational Science • Simulation • Data Bases • Analysis (organization and mining) • needed by simulations and • Experiments • Visualization • Each Science X • Has a comp-X branch • getting a X-info branch • Objectifying that science: defining terms precisely • This broadening is multi-disciplinary • Pair: good domain scientist + good computer scientist • Chemistry is important • A concrete way to approach Grid-computing.

  24. Outline • Want to build a TerraServer for Hungary? • Could be done inexpensively (if you have the data) • Microsoft would license the software to you • My view of eScience & Hungary • Hungary can’t lead in hardware • Hungary CAN lead in software • Algorithms: data mining, analysis • Tools: that implement the algorithms • Systems: learn by doing • Could start an industry, fits EU agenda.

More Related