1 / 47

Data Deduplication in Tivoli Storage Manager 6.1

Data Deduplication in Tivoli Storage Manager 6.1. Dave Canan Advanced Technical Support Support Technical Exchange July 2009 ddcanan@us.ibm.com. ATS Team. Dave Canan ddcanan@us.ibm.com Dave Daun djdaun@us.ibm.com Tom Hepner hep@us.ibm.com Randy Larson larsonr@us.ibm.com.

Download Presentation

Data Deduplication in Tivoli Storage Manager 6.1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Deduplication in Tivoli Storage Manager 6.1 Dave Canan Advanced Technical Support Support Technical Exchange July 2009 ddcanan@us.ibm.com TSM Lunch and Learn Data Deduplication

  2. ATS Team • Dave Canan • ddcanan@us.ibm.com • Dave Daun • djdaun@us.ibm.com • Tom Hepner • hep@us.ibm.com • Randy Larson • larsonr@us.ibm.com Data Deduplication in Tivoli Storage Manager 6.1

  3. Previous TSM V6 Sessions • Lunch-and-Learn – Internal IBM Sessions http://w3-03.ibm.com/support/techdocs/atsmastr.nsf/Web/Techdocs • Preparing for next release of TSM (January 22, 2009) • Technical Overview of TSM V6 (February 19, 2009) • TSM V6 Database Upgrade and Backup/Restore (April 2, 2009 Part 1) • TSM V6 Database Upgrade and Backup/Restore (April 9, 2009 Part 2) • Support Technical Exchange – Customer Sessions http://www-01.ibm.com/software/sysmgmt/products/support/supp_tech_exch.html • TSM v6 Announcement Technical Overview (February 26, 2009) • Preparing for TSM V6 (March 19, 2009) • TSM V6 Database Upgrade and Backup/Restore (April 28, 2009 Part 1) • TSM V6 Database Upgrade and Backup/Restore (May 1, 2009 Part 2) Data Deduplication in Tivoli Storage Manager 6.1

  4. Agenda • Deduplication concepts • Data deduplication in TSM 6.1 • Planning for data deduplication in TSM V6 • New/Changed commands and options for Data Deduplication in TSM V6 Data Deduplication in Tivoli Storage Manager 6.1

  5. Deduplication Concepts TSM Lunch and Learn Data Deduplication

  6. D I B J F H G E C A L K Deduplication Concept A C J B G H C E L D H B Unique subfiles B I K Duplicate subfiles E J C F K E Target data store (backup server) Data source 1 Data source 2 Data source 3 Data at source locations (backup client machines) Data Deduplication in Tivoli Storage Manager 6.1

  7. Data Store Data Store Data Store Data Store C B A C B A C b a C C c A A A B B B A A B B b a A A a How Deduplication Works 1. Data chunks are evaluated to determine a unique signature for each 2. Signature values are compared to identify all duplicates 3. Duplicate data chunks are replaced with pointers to a single stored chunk, saving storage space Data Deduplication in Tivoli Storage Manager 6.1

  8. Data Deduplication Value Proposition Potential advantages • Reduced storage capacity required for a given amount of data • Ability to store significantly more data on given amount of disk • Restore from disk rather than tape may improve ability to meet recovery time objective (RTO) • Network bandwidth savings (some implementations) • Lower storage-management and energy costs resulting from reduced storage requirements Potential tradeoffs/limitations • Significant CPU and I/O resources required for deduplication processing • Deduplication might not be compatible with encryption • Increased sensitivity to media failure because many files could be affected by loss of common chunk • Deduplication may not be suitable for data on tape because increased fragmentation of data could greatly increase access time Data Deduplication in Tivoli Storage Manager 6.1

  9. Where Deduplication is Performed Note: Source-side and target-side deduplication are not mutually exclusive Data Deduplication in Tivoli Storage Manager 6.1

  10. When Deduplication is Performed Note: In-band and out-of-band deduplication are not mutually exclusive Data Deduplication in Tivoli Storage Manager 6.1

  11. Deduplication Ratios • Used to indicate reduction achieved by deduplication • If deduplication reduces 500 TB of data to 100 TB, ratio is 5:1 • Ratios reflect design tradeoffs involving performance and reduction • Actual compression ratios will be highly dependent on other variables • Data from each source: redundancy, change rate, retention • Number of data sources and redundancy of data among those sources • Backup methodology: incremental forever, full+incremental, full+differential • Whether data encryption occurs prior to deduplication • In addition to above variables, some vendors include data reduction achieved by incremental backup and conventional compression • Deduplication vendors claim ratios in the range 2:1 to 500:1 • Meaningful comparison of ratios is extremely problematic—beware of hype! Data Deduplication in Tivoli Storage Manager 6.1

  12. Deduplication and Encryption Data source 1 Data encryption prior to deduplication processing can subvert data reduction No encryption Important text Important text Data source 2 Encryption key 1 Data deduplication Data store txpt tnatroemI Important text Important text Data source 3 Encryption key 2 txpt tnatroemI te tarpIxtntom Important text te tarpIxtntom 1. Three data sources have the same text file 2. After encryption, text files do not match 3. Deduplication processing does not detect redundancy 4. Text files are stored without data reduction Data Deduplication in Tivoli Storage Manager 6.1

  13. Data Deduplication in TSM V6 TSM Lunch and Learn Data Deduplication

  14. Data Reduction with TSM Today • Client compression • Files compressed by client before transmission • Conserves network bandwidth and server storage • Device compression • Compression performed by storage hardware • Conserves server storage • Incremental forever • After initial backup, file is not backed up again unless it changes • Conserves network bandwidth and server storage Client Server • Subfile backup • Only changed portions of files are transmitted • Conserves network bandwidth and server storage Storage Hierarchy • Appliance deduplication • Deduplication performed by storage appliance (VTL or NAS) • Conserves server storage Data Deduplication in Tivoli Storage Manager 6.1

  15. Native Data Deduplication in TSM • TSM’s incremental forever methodology greatly reduces data redundancy as compared to traditional methodologies based on periodic full backups • Consequently, there is less potential for data reduction via deduplication in TSM as compared to other backup products • Nevertheless, deduplication is an important function to TSM because it will allow more data objects to be stored on a given amount of disk for fast access • Native deduplication is a key product enhancement in TSM Data Deduplication in Tivoli Storage Manager 6.1

  16. B C B A C C A A B TSM 6.1 Deduplication Overview Deduplicated disk storage pool stores unique chunks to reduce disk utilization Client 1 Deduplication Server Client 2 Tape pool stores A, B, and C individually to avoid performance degradation during restore TSM Database Client 3 Files A, B and C have common data Allows more objects to be stored on disk for fast access Data Deduplication in Tivoli Storage Manager 6.1

  17. E G E G B B B E1 A0 A0 E2 B0 B0 C0 C0 E3 D0 D0 F1 B1 B1 G1 A1 B1 B2 C1 D1 Deduplication Example 2. Client2 backs up files E, F and G. File E has data in common with files B and G. 1. Client1 backs up files A, B, C and D. Files A and C have different names, but the same data. F A C D Client1 Client2 Server Server Vol1 A C D Vol1 A C D F Vol2 3. Server process “chunks” the data and identifies duplicate chunks C1, E2 and G1. 4. Reclamation processing recovers space occupied by duplicate chunks. Server Server Vol1 E1 Vol3 A1 B1 B2 D1 E1 E1 E3 F1 Vol2 Data Deduplication in Tivoli Storage Manager 6.1

  18. Effects when moving or copying data Data Deduplication in Tivoli Storage Manager 6.1

  19. Comparison of TSM Data Reduction Methods Data Deduplication in Tivoli Storage Manager 6.1

  20. Planning for Data Deduplication in TSM V6 TSM Lunch and Learn Data Deduplication

  21. Data Deduplication in Tivoli Storage Manager 6.1

  22. Considerations for Use of TSM Deduplication • Consider deduplication if • Data recovery would improve by storing more data objects on limited amount of disk • Data will remain on disk for extended period of time • Much redundancy in data stored by TSM (e.g., for common operating-system or project files) • TSM server CPU and disk I/O resources are available for intensive processing to identify duplicate chunks • Deduplication might not be suitable for • Mission-critical data, whose recovery could be delayed by accessing chunks that are not stored contiguously • TSM servers that do not have sufficient resources • Data that will soon be migrated to tape Data Deduplication in Tivoli Storage Manager 6.1

  23. Planning for TSM Deduplication • How do I want to control data duplication processes? • You can have them running all the time and have them process data as transactions commit. This could be more CPU intensive. • You can run them after backups have completed and then cancel them after the identification of duplicate data has finished. Consider whether you have the extra processing time in the day to run this as a separate step. (See discussion on controlling the identify duplicates process manually for suggestions). • Do you want to set up new storage pools or use existing ones? • You may want only certain nodes to perform data deduplication. These could be updated to new policy domains and management classes that point to new storagepools. Data Deduplication in Tivoli Storage Manager 6.1

  24. Planning for TSM Deduplication • How can I estimate my space savings with data deduplication in my environment? (2 possible techniques) • Best way to test this is with a test system you can delete when done. • Another way: Consider backing up your data from a primary storage pool to a temporary copy storage pool that has data deduplication enabled to estimate the space savings. Downside to this technique is that it will increase DB size. • Create a copy stgpool using a devclass type of FILE. • Do a BACKUP STGPOOL primary stgpool to copy stgpool. • Run IDENTIFY DUPLICATES against the volumes in the copy storagepool. • When the IDENTIFY DUPLICATES process goes to idle state, set the reclamation threshold for the storagepool to 1%. • After reclamation finishes, issue the q stgpool command against the copy storagepool to check the amount of space that was saved. • If the results are satisfactory, then update the primary stgpool to specify deduplication is to be used. (Or if type DISK, move data to a new stgpool that is defined with devclass FILE.) Data Deduplication in Tivoli Storage Manager 6.1

  25. Planning for TSM Deduplication • How much additional log space is required for data deduplication? • There are many factors in calculating this. See slide topic “How can I estimate my space savings with data deduplication in my environment?”, This, is addition to looking for the number of logs archived during the identify process, could be used to estimate the log consumption during the data deduplication identification process. • Is data deduplication suitable for all types of disk subsystems? • Restores from a deduplicated storage pool are random in nature. For disk subsystems that are slower in speed and do not have adequate cache (for example SATA), restores will be impacted with deduplicated data. Data Deduplication in Tivoli Storage Manager 6.1

  26. Planning for TSM Deduplication • What is the possible impact on restore performance if I implement TSM data deduplication? • Smaller files (less than 100K) that have been deduplicated will restore slower than files that are not deduplicated. Having more sessions doing the restore will improve restore performance. • How many Identify Processes should I run on my TSM server? • The Identify process is both CPU and IO intensive. You should run no more than “N” identify processes for an N-Way CPU. Each identify process can use the entire CPU, so if you need CPU for other processes, use less. Data Deduplication in Tivoli Storage Manager 6.1

  27. Planning for TSM Deduplication • What is the impact on database size when implementing data deduplication? • The average “chunk size” is 256K, and for each chunk, there is approximately 500 bytes of metadata added to the TSM DB. Also, files less than 2K are not eligible for data deduplication. • How do I change my daily housekeeping schedules for data deduplication? Where does the Identify Process fit in the daily cycle? • If you are going to run the identify process as part of daily housekeeping, then run it after BACKUP STGPOOL has completed but before reclamations for devclass FILE storagepools have run. You need to know when the Identify Duplicates process has gone to IDLE state, as these processes are different than other TSM processes. Sample select for to use for automation: Select count(*) from processes where status like ‘%State: idle%’ and process=‘Identify Process’ Data Deduplication in Tivoli Storage Manager 6.1

  28. Planning for TSM Deduplication • How should I start with Data Deduplication? • If you are going to use data deduplication on an existing storage pool, consider the following approach: • Initially set identify processes to 0. • On a daily basis, run identify duplicates for some duration until all volumes in storagepool processed. • Then update storagepool identifyprocess parameter to appropriate value. • If you are going to use data deduplication in a new storage pool, consider the following approach: • Initially set identify processes to 1 or 2. • On a daily basis, do a move nodedata for a few nodes into the new storage pool, and then point them to a new policy domain. Identify will process that set of node’s data. • When all nodes have been moved, delete old storagepool and change copygroups to reflect new storage hierarchy. • Then update storagepool identifyprocess parameter to appropriate value. Data Deduplication in Tivoli Storage Manager 6.1

  29. Expected Deduplication Behavior • Deployment of new clients or API applications not required (but use of TSM 6.1 client or higher may improve deduplication ratio). The TSM 6.1 client separates out the metadata which will slightly improve data deduplication for the client’s data. • Legacy data stored in or moved to enabled FILE storage pools can be deduplicated • Data migrated or copied to tape will be reduplicated to avoid excessive mounting and positioning during subsequent access • Ability to control number, duration and scheduling of CPU-intensive background processes for identification of duplicate data • Reporting of space savings in deduplicated storage pools • Deduplication processing will skip client-encrypted objects, but should work with storage-device encryption • Native TSM implementation, with no dependency on specific hardware Data Deduplication in Tivoli Storage Manager 6.1

  30. New/Changed commands and options for Data Deduplication in TSM V6 TSM Lunch and Learn Data Deduplication

  31. New Externals • DEFINE / UPDATE STGPOOL stgpoolname DEDUPlicate=No | Yes IDENTIFYPRocess=nn (default number of background processes) • Number of identify process specified in the storage pool definitions run indefinitely, or until you issue the identify process command, update the storage pool definition again, or cancel the process. • The Identify process is different from other server processes. When other server processes finish a task, they end. When duplicate identification processes finish, they quiesce and go into an idle state. Data Deduplication in Tivoli Storage Manager 6.1

  32. New Externals • IDentify DUPlicates stgpoolname DUration=mm (minutes to run process) NUMPRocess=nn (override stgpool setting) • When setting to a lower number of identify processes than are currently running, TSM will finish processing the current file before the identify duplicate process completes. • When duration expires and the identify process completes, the number of identify processes that are run reverts back to the number in the storage pool definition. • Result of command is different depending on whether the duration parameter is specified. See following slides for reference. These examples assume you have a storagepool initially defined with identify processes set to a value of 3. Data Deduplication in Tivoli Storage Manager 6.1

  33. Controlling Duplicate Identify Processes Manually If you had a critical restore executing, and you wanted to temporarily cut back on the identify processes for 2 hours, you could issue: Identify duplicates stgpool-name numpr=2 dur=120 Data Deduplication in Tivoli Storage Manager 6.1

  34. Controlling Duplicate Identify Processes Manually If you had only 1 hour left before reclamation starts, and you were not done with identify duplicates processing, you could issue for example: Identify duplicates stgpool-name numpr=4 dur=60 Data Deduplication in Tivoli Storage Manager 6.1

  35. Controlling Duplicate Identify Processes Manually Suggestion here for doing identify after backups, assuming an 8 hour backup window. Issue the following command at beginning of backups:identify duplicates stgpool-name numpr=0 duration=480 Data Deduplication in Tivoli Storage Manager 6.1

  36. Controlling Duplicate Identify Processes Manually Data Deduplication in Tivoli Storage Manager 6.1

  37. New Server Option • DEDUPREQUIRESBACKUP YES | NO • Indicates whether a volume in a PRIMARY DEDUP storage pool can be reclaimed before it is backed up to a non-deduplicated copy pool. Copying the data to an active data pool does not meet the backup requirement. • If YES, a volume in a dedup primary pool cannot be reclaimed until it has been backed up via BACKUP STGPOOL • YES is the default. • If NO, the reclamation criteria remains unchanged (same as previous reclamation criteria). • Setting this value to NO could cause unrecoverable data loss in the highly unlikely event an object generates a false-positive match on another extent. • This option can be changed dynamically with the SETOPT command. Data Deduplication in Tivoli Storage Manager 6.1

  38. New Server Option • NUMOPENVOLSALLOWED • Normally, TSM uses only 1 read volume (mount point) at a time. • To improve TSM performance for dedup objects, we now keep up to 10 volumes mountable for read access. • Reduces the amount of mount activity for chunks • Only used for dedup pools, does not apply to normal FILE pools. • Controlled via option in dsmserv.opt or via SETOPT command. • Each session or server process can have as many open FILE volumes as specified for this option. • Watch for mount point exceeded messages in actlog. The MOUNTLIMIT value in the devclass for storagepool using data dedup pools should be set to at LEAST this value. Data Deduplication in Tivoli Storage Manager 6.1

  39. Changed Commands/Output • Delete volume, delete filespace • The count of objects deleted is the number of “chunks” deleted, not the number of actual files. This may not match up with the number of objects in a ‘q occ’ command in a filespace with deduplicated data. • Q opt • Shows whether a file to be deduplicated first requires a backup of the file. (DedupRequiresBackup – default Yes) Data Deduplication in Tivoli Storage Manager 6.1

  40. Changed Commands/Output • Q occ • Physical space occupied for storagepools that are deduplicated is not shown. • Shows the “Reporting occupancy” in the logical occupancy field. Reporting occupancy represents how much space this data would occupy if this wasn’t in a data dedup pool • Shows actual number of files in the filespace. • If you turn off deduplication for a stgpool, there is no value displayed in the “physical occupancy” field until there are no more deduplicated files in storage pool. Data Deduplication in Tivoli Storage Manager 6.1

  41. Changed Commands/Output • Select from occupancy • Physical space occupied for storagepools that are deduplicated has value of 0.0. • Lists fields “Logical_MB” and “Reporting_MB”. Reporting_MB represents how much space this data would occupy if this wasn’t in a data dedup pool. Logical_MB is how much space is actually being used. Data Deduplication in Tivoli Storage Manager 6.1

  42. Changed Commands • Q pr • Identify duplicate data processes are different then other TSM server processes. After they finish processing all available files, they go into idle state. Data Deduplication in Tivoli Storage Manager 6.1

  43. Changed Commands/Output • Q stgpool f=d • Contains a new field “Duplicate data not stored.” This represents the amount of data eliminated in the storagepool via data dedup. The % shown is the amount saved / amount in the pool. • Example. Lets say that we are currently storing 20GB in a storage pool, and we didn’t store 10GB because of data deduplication. If we didn’t have data deduplication enabled, we would have needed 30GB of space. So, our % is 10/30 or 33%. • % Utilization reflects physical occupancy, which means it won't change until you reclaim the volume Data Deduplication in Tivoli Storage Manager 6.1

  44. Changed Commands/Output • Q stgpool f=d Data Deduplication in Tivoli Storage Manager 6.1

  45. New Command • Show deduppending stgpool-name • This command shows the amount of data that has been identified as duplicate data, but has not yet been processed by reclamation to delete it. Data Deduplication in Tivoli Storage Manager 6.1

  46. Changed Behavior of Restores • NQR Restore of Deduplicated data • Long delay might be seen before files are transferred to client. This is because all files that are to be processed for restore are determined before restore starts. Data Deduplication in Tivoli Storage Manager 6.1

  47. Questions? Data Deduplication in Tivoli Storage Manager 6.1 Basic text slide

More Related