1 / 33

Designing for 20TB Disk Drives And “enterprise storage”

Designing for 20TB Disk Drives And “enterprise storage”. Jim Gray, Microsoft research. Kilo Mega Giga Tera Peta Exa Zetta Yotta. Disk Evolution. Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 TB? in 2012?! System on a chip High-speed SAN

symona
Download Presentation

Designing for 20TB Disk Drives And “enterprise storage”

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Designing for 20TB Disk DrivesAnd “enterprise storage” Jim Gray, Microsoft research

  2. Kilo Mega Giga Tera Peta Exa Zetta Yotta Disk Evolution • Capacity:100x in 10 years 1 TB 3.5” drive in 2005 20 TB? in 2012?! • System on a chip • High-speed SAN • Disk replacing tape • Disk is super computer!

  3. Disks are becoming computers • Smart drives • Camera with micro-drive • Replay / Tivo / Ultimate TV • Phone with micro-drive • MP3 players • Tablet • Xbox • Many more… ApplicationsWeb, DBMS, Files OS Disk Ctlr + 1Ghz cpu+ 1GB RAM Comm: Infiniband, Ethernet, radio…

  4. Intermediate Step: Shared Logic Snap ~1TB 12x80GB NAS • Brick with 8-12 disk drives • 200 mips/arm (or more) • 2xGbpsEthernet • General purpose OS • 10k$/TB to 100k$/TB • Shared • Sheet metal • Power • Support/Config • Security • Network ports • These bricks could run applications (e.g. SQL or Mail or..) NetApp ~.5TB 8x70GB NAS Maxstor ~2TB 12x160GB NAS IBM TotalStorage ~360GB 10x36GB NAS

  5. Hardware • Homogenous machines leads to quick response through reallocation • HP desktop machines, 320MB RAM, 3u high, 4 100GB IDE Drives • $4k/TB (street), 2.5processors/TB, 1GB RAM/TB • 3 weeks from ordering to operational Slide courtesy of Brewster Kahle, @ Archive.org

  6. Disk as Tape • Tape is unreliable, specialized, slow, low density, not improving fast, and expensive • Using removable hard drives to replace tape’s function has been successful • When a “tape” is needed, the drive is put in a machine and it is online. No need to copy from tape before it is used. • Portable, durable, fast, media cost = raw tapes, dense. Unknown longevity: suspected good. Slide courtesy of Brewster Kahle, @ Archive.org

  7. Disk As Tape: What format? • Today I send NTFS/SQL disks. • But that is not a good format for Linux. • Solution: Ship NFS/CIFS/ODBC servers (not disks) • Plug “disk” into LAN. • DHCP then file or DB server via standard interface. • Web Service in long term

  8. State is Expensive • Stateless clones are easy to manage • App servers are middle tier • Cost goes to zero with Moore’s law. • One admin per 1,000 clones. • Good story about scaleout. • Stateful servers are expensive to manage • 1TB to 100TB per admin • Storage cost is going to zero(2k$ to 200k$). • Cost of storage is management cost

  9. Databases (== SQL) • VLDB survey (Winter Corp). • 10 TB to 100TB DBs. • Size doubling yearly • Riding disk Moore’s law • 10,000 disks at 18GB is 100TB cooked. • Mostly DSS and data warehouses. • Some media managers

  10. Interesting facts • No DBMSs beyond 100TB. • Most bytes are in files. • The web is file centric • eMail is file centric. • Science (and batch) is file centric. • But…. • SQL performance is better than CIFS/NFS.. • CISC vs RISC

  11. BarBar: the biggest DB • 500 TB • Uses Objectivity™ • SLAC events • Linux cluster scans DB looking for patterns

  12. 300 TB (cooked)Hotmail / Yahoo • Clone front ends ~10,000@hotmail. • Application servers • ~100 @ hotmail • Get mail box • Get/put mail • Disk bound • ~30,000 disks • ~ 20 admins

  13. AOL (msn) (1PB?) • 10 B transactions per day (10% of that) • Huge storage • Huge traffic • Lots of eye candy • DB used for security/accounting. • GUESS AOL is a petabyte • (40M x 10MB = 400 x 1012)

  14. Google1.5PB as of last spring • 8,000 no-name PCs • Each 1/3U, 2 x 80 GB disk, 2 cpu 256MB ram • 1.4 PB online. • 2 TB ram online • 8 TeraOps • Slice-price is 1K$ so 8M$. • 15 admins (!) (== 1/100TB).

  15. Astronomy • I’ve been trying to apply DB to astronomy • Today they are at 10TB per data set • Heading for Petabytes • Using Objectivity • Trying SQL (talk to me offline)

  16. Scale Out: Buy Computing by the Slice709,202 tpmC! == 1 Billion transactions/day • Slice: 8cpu, 8GB, 100 disks (=1.8TB) 20ktpmC per slice, ~300k$/slice • clients and 4 DTC nodes not shown

  17. ScaleUp: A Very Big System! • UNISYS Windows 2000 Data Center Limited Edition • 32 cpus on • 32 GB of RAM and • 1,061 disks (15.5 TB) • Will be helped by 64bit addressing 24 fiber channel

  18. 2200 2200 2200 E E J J O O 2200 2200 2200 G F P Q K L 2200 2200 2200 R S M N H I Hardware 8 Compaq DL360 “Photon” Web Servers One SQL database per rack Each rack contains 4.5 tb 261 total drives / 13.7 TB total Fiber SAN Switches Meta Data Stored on 101 GB “Fast, Small Disks”(18 x 18.2 GB) SQL\Inst1 Imagery Data Stored on 4 339 GB “Slow, Big Disks” (15 x 73.8 GB) SQL\Inst2 SQL\Inst3 To Add 90 72.8 GB Disks in Feb 2001 to create 18 TB SAN Spare 4 Compaq ProLiant 8500 Db Servers

  19. Amdahl’s Balance Laws • parallelism law: If a computation has a serial part S and a parallel component P, then the maximum speedup is (S+P)/S. • balanced system law: A system needs a bit of IO per second per instruction per second:about 8 MIPS per MBps. • memory law:=1:the MB/MIPS ratio (called alpha ()), in a balanced system is 1. • IO law: Programs do one IO per 50,000 instructions.

  20. Amdahl’s Laws Valid 35 Years Later? • Parallelism law is algebra: so SURE! • Balanced system laws? • Look at tpc results (tpcC, tpcH) at http://www.tpc.org/ • Some imagination needed: • What’s an instruction (CPI varies from 1-3)? • RISC, CISC, VLIW, … clocks per instruction,… • What’s an I/O?

  21. MHz/ cpu CPI mips KB/ IO IO/s/ disk Disks Disks/ cpu MB/s/ cpu Ins/ IO Byte Amdahl 1 1 1 6 8 TPC-C= random 550 2.1 262 8 100 397 50 40 7 TPC-H= sequential 550 1.2 458 64 100 176 22 141 3 TPC systems • Normalize for CPI (clocks per instruction) • TPC-C has about 7 ins/byte of IO • TPC-H has 3 ins/byte of IO • TPC-H needs ½ as many disks, sequential vs random • Both use 9GB 10 krpm disks (need arms, not bytes)

  22. TPC systems: What’s alpha (=MB/MIPS)? Hard to say: • Intel 32 bit addressing (= 4GB limit). Known CPI. • IBM, HP, Sun have 64 GB limit. Unknown CPI. • Look at both, guess CPI for IBM, HP, Sun • Alpha is between 1 and 6

  23. Performance (on current SDSS data) • Run times: on 15k$ COMPAQ Server (2 cpu, 1 GB , 8 disk) • Some take 10 minutes • Some take 1 minute • Median ~ 22 sec. • Ghz processors are fast! • (10 mips/IO, 200 ins/byte) • 2.5 m rec/s/cpu ~1,000 IO/cpu sec ~ 64 MB IO/cpu sec

  24. How much storage do we need? Yotta Zetta Exa Peta Tera Giga Mega Kilo Everything! Recorded • Soon everything can be recorded and indexed • Most bytes will never be seen by humans. • Data summarization, trend detection anomaly detection are key technologies See Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info/ All BooksMultiMedia All LoC books (words) .Movie A Photo A Book 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli

  25. Standard Storage Metrics • Capacity: • RAM: MB and $/MB: today at 512MB and 200$/GB • Disk: GB and $/GB: today at 80GB and 70k$/TB • Tape: TB and $/TB: today at 40GB and 10k$/TB (nearline) • Access time (latency) • RAM: 100 ns • Disk: 15 ms • Tape: 30 second pick, 30 second position • Transfer rate • RAM: 1-10 GB/s • Disk: 10-50 MB/s - - -Arrays can go to 10GB/s • Tape: 5-15 MB/s - - - Arrays can go to 1GB/s

  26. New Storage Metrics: Kaps, Maps, SCAN • Kaps: How many kilobyte objects served per second • The file server, transaction processing metric • This is the OLD metric. • Maps: How many megabyte objects served per sec • The Multi-Media metric • SCAN: How long to scan all the data • the data mining and utility metric • And • Kaps/$, Maps/$, TBscan/$

  27. 100 GB 30 MB/s More Kaps and Kaps/$ but…. • Disk accesses got much less expensive Better disks Cheaper disks! • But: disk arms are expensivethe scarce resource • 1 hour Scanvs 5 minutes in 1990

  28. Data on Disk Can Move to RAM in 10 years 100:1 10 years

  29. The “Absurd” 10x (=4 year) Disk • 2.5 hr scan time (poor sequential access) • 1 aps / 5 GB (VERY cold data) • It’s a tape! 1 TB 100 MB/s 200 Kaps

  30. It’s Hard to Archive a PetabyteIt takes a LONG time to restore it. • At 1GBps it takes 12 days! • Store it in two (or more) places online (on disk?).A geo-plex • Scrub it continuously (look for errors) • On failure, • use other copy until failure repaired, • refresh lost copy from safe copy. • Can organize the two copies differently (e.g.: one by time, one by space)

  31. Auto Manage Storage • 1980 rule of thumb: • A DataAdmin per 10GB, SysAdmin per mips • 2000 rule of thumb • A DataAdmin per 5TB • SysAdmin per 100 clones (varies with app). • Problem: • 5TB is 50k$ today, 5k$ in a few years. • Admin cost >> storage cost !!!! • Challenge: • Automate ALL storage admin tasks

  32. How to cool disk data: • Cache data in main memory • See 5 minute rule later in presentation • Fewer-larger transfers • Larger pages (512-> 8KB -> 256KB) • Sequential rather than random access • Random 8KB IO is 1.5 MBps • Sequential IO is 30 MBps (20:1 ratio is growing) • Raid1 (mirroring) rather than Raid5 (parity).

  33. Data delivery costs 1$/GB today • Rent for “big” customers: 300$/megabit per second per month • Improved 3x in last 6 years (!). • That translates to 1$/GB at each end. • You can mail a 160 GB disk for 20$. • That’s 16x cheaper • If overnight it’s 4 MBps. 3x160 GB ~ ½ TB

More Related