1 / 18

Physical Data Storage

Physical Data Storage. Stephen Dawso n-Haggerty. Hadoop. sMAP. StreamFS. HDFS. - Data exploration/visualization Control Loops Demand response Analytics Mobile feedback Fault detection . sMAP. Applications. sMAP. sMAP. Data Sources. Time -Series Databases. Expected workload

kiele
Download Presentation

Physical Data Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Physical Data Storage Stephen Dawson-Haggerty

  2. Hadoop sMAP StreamFS HDFS • - Data exploration/visualization • Control Loops • Demand response • Analytics • Mobile feedback • Fault detection sMAP Applications sMAP sMAP Data Sources

  3. Time-Series Databases • Expected workload • Related work • Server architecture • API • Performance • Future directions

  4. sMAP Sources HTTP/REST protocol for exposing physical information Data trickles in as its generated Typical data rates: 1 reading/1-60s Bulk imports Existing databases Migrations Write Workload sMAP Dent circuit meter sMAP

  5. Read Workload • Plotting engine • Matlab & python adaptors for analysis • Mobile apps • Batch analysis • Dominated by range queries • Latency is important, for interactive data exploration

  6. query aggregate resample streaming pipeline insert Time-series Interface Bucketing Compression Storage mapper RPC readingdb Key-Value Store SQL Page Cache Lock Manager StorageAlloc. MySQL

  7. Time series interface • All data is part of a stream, identified only by streamid • A stream is a series of tuples: • (timestamp, sequence, value, min, max)

  8. Storage Manager: BDB • Berkeley Database: embedded key-value store • Store binary blobs using B+ trees • Very mature: around since 1992, supports transactions, free-threading, replication • We use version 4

  9. RPC Evolution • First: shared memory • Low latency • Move to threaded TCP • Google protocol buffers • zig-zag integer representation, multiple language bindings • Extensible for multiple versions

  10. On-Disk Format • All data stores perform poorly with one key per reading • index size is high • unnecessary • Solution: bucket readings • Excellent locality of reference with B+ tree intexes • Data sorted by streamid and timestamp • Range queries translate into mostly large sequential IOs (streamid, timestamp) bucket

  11. On-Disk Format • Represent in memory with materialized structure – 32b/rec • Inefficient on disk – lots of repeated data, missing fields • Solution: compression • First: delta encode each bucket in protocol buffer • Second: Huffman Tree or Run Length encoding (zlib) • Combined compression 2x better than gzip or either one • 1m rec/second compress/decompress on modest hardware bdb page ... compress

  12. Other Services: Storage Mapping • What is in the database? • Compute a set of tuples (start, end, n) • The desired interpretation is “the data source was alive” • Different data sources have different ways of maintaining this information and maintaining confidence • Sometimes you have to infer it from the data • Sometime data sources give you liveness/presence guarantees • “I haven’t heard from you in an hour, but I’m still alive!” dead or alive?

  13. readingdb6 • Up since December supporting Cory Hall, SDH Hall, most other LoCal Deployments • behind www.openbms.org • > 2 billion points in 10k streams • 12Gb on disk ~= 5b/rec including index • So... we fit in memory! • Import at around 300k points/sec • We maxed out the NIC

  14. Low Latency RPC

  15. Compression ratios

  16. Write load Importing old data: 150k points/sec Continuous write load: 300-500pts/sec

  17. Future thoughts • A component of a cloud storage stack for physical data • Hadoop adaptor: improve Mapreduce performance over Hbase solution • The data is small: 2 billion points in 12GB • We can go a long time without distributing this very much • Probably necessary for reasons other than performance

  18. The end

More Related