310 likes | 429 Views
Point Data. Overview. NetCDF -CF proposal CDM Point Feature API TDS DAPPER CdmRemote. CF proposal. Encode common “discrete sampling” data collections into NetCDF classic model / netCDF-3 format Proposal to CF Conventions “ Discrete Sampling Geometries”
E N D
Overview • NetCDF-CF proposal • CDM Point Feature API • TDS • DAPPER • CdmRemote
CF proposal • Encode common “discrete sampling” data collections into NetCDF classic model / netCDF-3 format • Proposal to CF Conventions • “Discrete Sampling Geometries” • John Caron, Steve Hankin, Jonathan Gregory • 25 pages, 2 years and counting • Version 1 Real Soon Now (really) • https://cf-pcmdi.llnl.gov/trac/
Discrete Sampling Data encoding • Encoding variants • Multidimensional arrays • Contiguous Ragged Arrays • Indexed Ragged Arrays • Single Feature in a file • Make it easy / efficient to • Read a Feature from a file • Subset the collection by space and time
Discrete Sample Feature Types • point: a collection of data points with no connection in time and space • timeSeries: a series of data points at the same location, with varying time • trajectory: a series of data points along a curve in time and space • profile: a set of data points along a vertical line • timeSeriesProfile: a series of profiles at the same location, with varying time • trajectoryProfile: a set of profiles which originate from points along a trajectory
CDM Implementation • Have a working implementation of the entire proposal in CDM 4.2 • Some minor changes not yet made • Beta: needs user testing • Not entirely happy with the API (complexity) • Based on “Nested Table” abstraction • common representations • ArrayStructure, Construct, Contiguous, LinkedList, MultidimInner, MultidimInner3D, MultidimInnerPsuedo, MultidimInnerPsuedo3D, MultidimStructure, NestedStructure, ParentId, ParentIndex, Singleton, Structure, Top • XML configuration possible – easy to add new datasets
Table ConfigurerPlugins • BUFR • CF Conventions • Cosmic • GEMPAK Point • IRIDL station (IRI/LDEO Climate Data Library) • Jason (NASA Ocean Surface Topography Mission) • FSL Wind Profiler • MADIS ACARS • MADIS surface observations • NBDC (National Buoy Data Center) • NCAR-RAF/nimbus • NLDN (National Lightning Data Network) • Suomi-Station • Unidata Observation Dataset Conventions
Application Point Feature API CDM architecture Point Feature Types Datatype Adapter BUFR NetcdfDataset GEMPAK Table Configurer Plugins CoordSystem Builder CF NetcdfFile COSMIC I/O service provider … NetCDF-3 NIDS NetCDF-4 GRIB …
CDM PointFeature UML
CDM Point Feature API FeatureDatasetfd = FeatureDatasetFactoryManager.open( FeatureType.STATION, location, null, log); FeatureCollectionfc = fd.getPointFeatureCollectionList().get(0); StationCollectiontimeSeriesCollection = (StationCollection) fc; PointFeatureCollection points = timeSeriesCollection.flatten( new LatLonRect( new LatLonPointImpl(40.0, -105.0), new LatLonPointImpl(42.0, -100.0)), new DateRange(start, end)); // iterate while(points.hasNext()) { ucar.nc2.ft.PointFeature pointFeature = points.next() Location loc = pointFeature.getLocation(); ... }
Some observations • All requests are in “coordinate space” • Indicate what subset you want, then request “all at once” • Subset is virtual • Library can (try to) optimize the request • “Iterators over result set” vs List or Array • Result set does not have to fit into memory • Allow streaming data (don’t wait until you have it all)
Index vs Coordinate Request float temp(station, time); float data = temp.read(234,23); http://server/Metar.dods?temp[234][23] vs http://server/ncss/Metar.nc?var=temp&time=2008-10-28T12:00:00Z& station=KMDR SELECT temp FROM metar WHERE metar.time= “2008-10-28T12:00:00Z” AND metar.station= “KMDR” Array data = temp.read(“2008-10-28T12:00:00Z”, “KMDR”);
Multidimvs Ragged float temp(sample); intstation_index(sample); for (inti=0; i<sample.len; i++) { if (station_index(i) == KMDR_index) data = read(“http:/server/Metar.dods?temp[i]”); // BAD for (inti=0; i<sample.len; i++) { if (station_index(i) == KMDR_index) indexList.add(i); http:/server/Metar.dods?temp[2,3,5,78,90,123,456,789] // BETTER vs http://server/ncss/Metar.nc?var=temp&time=all&station=KMDR // MO BETTA
Indexed access considered harmfulfor rolling data archives • Cannot deal with constantly changing dataset • This is a contract with the application • When can you break it? • Difficult to reconcile with HTTP/OPeNDAP as a stateless protocol • TDS is broken (Shhhh…) float temp(sample=238743874); float time(sample=238743874); :units= “secs since 2008-10-28T12:00:00Z”;
Where are we going? • Indexed access ok for local, static, “small” datasets • Need new data access paradigm for large, changing, remote dataset collections • Requests in Coordinate Space • Specify entire subset at once – aka “set at a time” • Allow parallelism, optimization
THREDDS Data Server • Forecast Model Run Collection (2D time) • Create a set of 1D Grid datasets • Place in the TDS Configuration catalog: <featureCollectionfeatureType="FMRC” path="fmrc/NCEP/GFS/CONUS_80km"> <collection spec="/data/NCEP/GFS_CONUS_80km_#yyyyMMdd_HHmm#.grib1“/> <update startup="true" rescan="0 5 3 * * ? *" trigger="allow"/> <protoDataset choice="Penultimate" change="0 2 3 * * ? *" /> <fmrcConfigdatasetTypes="Best Files Runs ConstantForecasts" /> </featureCollection>
TDS Point Feature Collection • Scheduled for TDS 4.3 • Configuration: • What services to expose? • Not indexed data access • Hook into Point Feature API on client <featureCollectionfeatureType=“STATION” path="nws/metar/ncdecoded "> <collection spec="/data/metar/Surface_METAR_#yyyyMMdd_HHmm#.nc$“/> <update startup="true" rescan="0 5 3 * * ? *" trigger="allow"/> <protoDataset choice="Penultimate" change="0 2 3 * * ? *" /> </featureCollection>
DAPPER TimeSeries Dataset { Sequence {Float32 lat;Float32 lon;Float32 elev;Int32 _id;Sequence { Float32 visibility; Float32 max_wind_gust; Float32 dewp; Float64 time; Float32 slp; Float32 temp; Float32 wind_speed; Float32 max_temp; Float32 max_sustained_wind_speed; Float32 min_temp; Float32 precip; } time_series; } location; ... } gsod_time_series;
DAPPER Conventions • Two-level (nested) DAP 2 Sequences • Ragged Arrays with coordinate subsetting • Handles timeSeries and profile FeatureTypes • Requires a unique id for each feature • Requires lat / lon / z / time coordinates • Handle longitude wrapping? (yes) • Data variables only in inner sequence, must be floats • Handling of CE on data variables not required • OPeNDAP spec requires • How does client know what if CE allowed ?
DAPPER Conventions - Analysis • Likely easy to hook up to CDM Station/Profile Feature Collection API • Needs to be generalized / clarified • to handle arbitrary datasets • to support the other Point Feature Types • Not sure who would be in change of standard ? • Result set has fixed layout – makes streaming hard • Not accessible through NetCDFAPI – who are the clients? • DAP 2 cant transport NetCDF-4 / CDM data model • Shared Dimensions, Groups, enums, longs, chars, etc • DAP 4 where are you? • Doesn’t have a general coordinate system mechanism.
NetCDF Subset Service (4.0) • Experiment with REST style web service • Allow to subset the dataset by: • Lat/lon bounding box • time and vertical coordinate range • list of Variables • Gridded Data • Output is NetCDF –CF file • Variation of WCS (simplified request protocol) • Grid as Point Datasets • Extract vertical profile, time series from one point in model data • Output: NetCDF-CF, XML, CSV • Tried to use for point datasets • NetCDF cant be streamed • Quite slow for large data collections
ncstream (4.1) • NetCDF files (almost always) have to be written, then copied to network • Assumes random access, not stream • “read optimized” : data layout is known • ncstream explores what “streaming netcdf” might look like • “write-optimized”: append only • Efficient conversion to netCDF files on the client • Ncstreamdata model == CDM data model • Binary encoding using Google'sProtobuf • Binary object serialization, cross language, transport nuetral, extensible • Very fast: some tests show >10x OPeNDAP • Have experimental versions in CDM and TDS since 4.1
CdmRemote web services (4.2) • Follow on to Netcdf Subset Service • Point Feature datasets • Use ncstream for the OTW protocol • In CDM, TDS since version 4.2 • Need to add FeatureCollection configuration
Accessing Point Feature Collections Java Client C Client TDS cdmRemote CDM Point Feature API CDM Point Feature API CDM Point Feature API Application Application Coordinate Systems Data Access Data
Possibility: CdmRemote Server • Lightweight server for CDM datasets • Zero configuration • Local filesystem • Cache expensive objects • Java and C clients • Allow non-Java applications access to CDM stack • Coordinate space queries • Virtual datasets • Feature Types
C library – enable other languages Python / ? cdmRemote Server C Client cdmRemote CDM Point Feature API CDM Point Feature API Application Coordinate Systems Data Access Data
Summary • Discrete Sampling CF Conventions almost ready • CDM Point Feature API ready for testing • TDS Point Feature Collections almost ready • Using cdmRemote/ ncstream • Needs catalog configuration mechanism • Need new APIs : what should they be? • What are the clients? • Unidata is evaluating new APIs in C using ncstream as IPC to Java services in another process • May add Python to our list, as resources permit • Open to other solutions, as resources permit
PS: Can you say SQL? • Jim Gray: “Scientific Data Management in the Coming Decade” • Michael Stonebraker: SciDB (http://scidb.org/) • Array oriented data model – extends relational tables • Useable release in Jan 2011 • Evaluating participating in this effort • New Data Access APIs: • Requests in Coordinate Space • Specify entire subset at once – aka “set at a time” • Allow parallelism, optimization
THREDDS/CDM Developers Conference • Invitational • Show related work • Launch Open Source project • Steering Committee • Broader than Unidata • Fall 2011 (?) • FOSS4G conference in Denver (Sep 12) • OGC Technical Committee / Boulder (Sep 19)