1 / 25

Backup & Recovery of Physics Databases

Backup & Recovery of Physics Databases. Jacek Wojcieszuk IT-DM Technical Meeting December 16 th , 2008. Outline. Why to backup? Current implementation – Maximum Availability Architecture Main concerns Possible improvements. IT/DM technical meeting - 2. Why to backup?.

fdenny
Download Presentation

Backup & Recovery of Physics Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Backup & Recovery of Physics Databases Jacek Wojcieszuk IT-DM Technical Meeting December 16th, 2008

  2. Outline • Why to backup? • Current implementation – Maximum Availability Architecture • Main concerns • Possible improvements IT/DM technical meeting - 2

  3. Why to backup? • Backup is one of the main techniques for data protection • Properly planned and implemented backup and recovery strategy is critical for business continuity • Ideally backups should allow for recovery from any kind of failure without data loss. IT/DM technical meeting - 3

  4. IT/DM technical meeting - 4 Types of failures • Oracle instance failure • Usually due to an Oracle process failure • Media failure • Disk failure, controller failure, etc. • Physical data corruption • Human error • In most cases accidentally deleted/updated data • Database user or DBA • Disaster • Fire, flood, earthquake, plane crash, overvoltage, etc.

  5. Available tools • Oracle offers many tools that help to backup data and address failures: • Recovery Manager (RMAN) • Data Guard • Export/Import • Data Pump • Streams • Oracle supports using OS and hardware features for taking backups • snapshots • cp command • None of mentioned tools alone can protect the data from all types of failures IT/DM technical meeting - 5

  6. Oracle Maximum Availability Architecture (MAA) Oracle's best practices blueprint Goal: to achieve the optimal high availability architecture at the lowest cost and complexity Helps to minimize impact of different types of unplanned and planned downtimes Is based on such Oracle products/features like: RAC ASM RMAN Flashback Data Guard http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm IT/DM technical meeting - 6

  7. Data protection at CERN – RAC + ASM + on-disk copies Clients WAN/Intranet RAC database with ASM RMAN SAN TSM IT/DM technical meeting - 7

  8. Data Guard Physical Standby Database WAN/Intranet Data changes Primary RAC database with ASM Physical Standby RAC database with ASM RMAN IT/DM technical meeting - 8

  9. Backup implementation - summary • Tape-based incremental backup strategy • Be-weekly full backups • Daily incremental backups • Hourly archivelog backups • Disk-based incrementally updated image copies • The copy lags 2-3 days behind the database to facilitate handling human errors • Oracle Data Guard physical standby databases • Configured for 1 day lag IT/DM technical meeting - 9

  10. Failure handling IT/DM technical meeting - 10

  11. Main concerns Data volume Very quick, linear growth of databases (especially during LHC run) Some databases may reach 10 TB already next year Database availability On-line databases highly critical for data taking Few hours of downtime can already lead to data loss Off-line databases critical for data distribution and analysis IT/DM technical meeting - 11

  12. ... More concerns With data volume increase: Increases probability of physical data corruption Increases frequency of human errors Traditional RMAN and tape-based approach doesn’t fit well into this picture: Leads to backup and recovery times proportional to or dependent on data volume Currently at CERN speed of backup/recovery to/from tapes limited by the speed of 1 Gb Ethernet Standby database located in the same building as primary IT/DM technical meeting - 12

  13. Possible improvements • LAN-free backups to tapes • Disk pool instead of tapes • Declaring data read-only • Archiving old data to limit database growth IT/DM technical meeting - 13

  14. LAN-free tape backups Traditionally at CERN tape backups are sent over a general purpose network to a media management server: This limits backup/recovery speed to ~80 MB/s Backup/restore of a 10TB database takes almost 40 hours! At the same time tape drives can archive data with the speed of 160 MB/s compressed TSM Server Metadata Database RMAN backups 1Gb Tape drives IT/DM technical meeting - 14

  15. LAN-free tape backups (2) Tivoli Storage Manager supports so-called LAN-free backup When using LAN-free configuration: Backup data flows to tape drives directly over SAN Media Management Server used only to register backups Very good performance observed during tests (see next slide) TSM Server Metadata Database RMAN backups 1Gb Tape drives IT/DM technical meeting - 15

  16. LAN-free tape backups - tests 1 TB test database with contents very similar to one of the production DB ~5% of empty blocks Different TSM configurations: TCP and Shared Memory mode Backups taken using 1 or 2 streams • Restore tests done using 1 stream only • Performance of a test with 2 streams affected by Oracle software issues (followed up with Oracle Support) IT/DM technical meeting - 16

  17. Using a disk pool instead of tapes Tape infrastructure is expensive and difficult to maintain: costly hardware and software noticeable maintenance effort tape media is quite unreliable and needs to be validated At the same time disk space is getting cheaper and cheaper: 1.5 TB SATA disks already available Pool of disks can be easily configured as destination for RMAN backups: Can simplify backup infrastructure Can improve backup performance Can increase backup reliability IT/DM technical meeting - 17

  18. Using a disk pool instead of tapes (2) Several configurations possible: NFS mounted disks SAN-attached storage with a file system (tested) SAN-attached storage with ASM (tested) RMAN backups Storage Area Network Database Remote storage IT/DM technical meeting - 18

  19. Using a disk pool instead of tapes - tests Test perform as part of backup&recovery implementation for LHCb on-line database 1x16-bay disk array used as backup storage 2x8-disk RAID 5 devices Backup storage configured either as an ASM diskgroup or ext3 file system Tests performed with 4 streams • Tests need to be repeated in the testbed used for test LAN-free backups to tapes IT/DM technical meeting - 19

  20. Declaring data read-only Tablespaces containing static data can be declared read-only Read-only tablespaces does not need to be backed up as often as read-write ones This may significantly reduce amount of resources needed for backups In case of restores one can think of restoring read-only data after read-write part of the database is restored We are improving our backup scripts to handle properly read-only data IT/DM technical meeting - 20

  21. Archiving legacy data Data collected by some big applications is accessed for a very limited period of time only Later on it is not needed but has to be kept ‘just in case’ Keeping such data on-line may badly affect database performance and time-to-recover To avoid that legacy data should be archived Archiving of Oracle data is not an easy task and cannot be transparently done by DBAs Proper application design is vital Splitting data using Oracle partitioning or schemas Ensuring self-containment of data from different periods IT/DM technical meeting - 21

  22. Data Archiving – possible implementation • Several implementations possible – the simplest presumes creation of an archive DB IT/DM technical meeting - 22

  23. Conclusions Production databases are growing very large Recovery time in case of failure becomes critical Certain types of failures require database restore from a tape backup Time-to-recover proportional to the database size Hours or days in case of big databases LAN-free backups to tapes can significantly shorten backup&recovery time and lead to better resource utilization Replacing tapes with a disk pool can also result in significant backup and recovery time decrease Declaring data read-only and archiving can further help to keep the restore time reasonable IT/DM technical meeting - 23

  24. Acknowledgements Many thanks to Dawid, Luca and Lukasz who were helping with the tests and Oracle tuning IT/DM technical meeting - 24

  25. Q&A Thank you

More Related