470 likes | 848 Views
Centera Integration Training. Centera API Best Practices. Corporate Systems Engineering July 2007. Agenda. Refresher on Centera API Performance Optimizations Metadata Storage Strategies & ClipID Formats SDK Options (Buffers / Failover) Managing Retention Wrap Up and Questions.
E N D
EMC Centera – API Best Practices Centera Integration Training Centera API Best Practices Corporate Systems Engineering July 2007
EMC Centera – API Best Practices Agenda • Refresher on Centera API • Performance Optimizations • Metadata • Storage Strategies & ClipID Formats • SDK Options (Buffers / Failover) • Managing Retention • Wrap Up and Questions
EMC Centera – API Best Practices API Background • Understand the main Centera API concepts • Pools, Clips, Streams, Tags, Blobs, Query • Appreciate the following points: • Open, read, write, query and delete transactions contact the Centera while clip create, get /set attribute, close are local SDK operations • 1 transaction uses 1 socket is 1 thread of access • C-Clips only have relationships with blobs, never other C-Clips • The high transactional overhead for any write operation and how it affects the writing of small objects • Understanding these points makes integration easier, more efficient and more effective!
EMC Centera – API Best Practices Deploy In A Multi-Tier Architecture Clients Running Application Application CentraStar Software API Library OS • CentraStar Software resides on the Centera. • API Library resides on the application server. • Client application on PC connects to application server. • Server application makes calls to Centera-supplied API. • API interacts with Centera over TCP/IP using HPP protocol.
TransactionExample (Write) Application Server Application API Library Operating System Centera EMC Centera – API Best Practices • The client sends a write (import) request to the server • The server sends an acknowledgement to the client • The client begins streaming its data to the server • After the client has sent its last packet of data, it begins waiting for an ACK from the server • The server distributes the data to the Storage Nodes, and updates its indices • The server sends an acknowledgementto the client and the write operation returns Request ------ > < -----------ACK Data-------- > Data-------- > Data-------- > Data-------- > . .(Client waiting for ACK) . Client < -----------ACK
EMC Centera – API Best Practices PerformanceOptimizations • Multithreading • Small Object Management • Embedded Blobs • Containerization • Hybrid Containerization • Huge Object Management • Blob Slicing • Pool Management
EMC Centera – API Best Practices Optimizing Write Performance Large Object (~5Mb) setup commit Small Object (~50Kb) setup commit So for a single stream over time we see … … with small objects, most of the total transaction time is spent in setup and commit – hardly any in data transfer!
EMC Centera – API Best Practices Multithreaded Writes Multithreading provides overlapped I/O – more data is transferred in a given period of time. Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 8
EMC Centera – API Best Practices Multithreading • Advantages • Better utilization of available bandwidth • Overlapped I/O yields better throughput • Takes advantage of multiple access nodes • Shared PoolConnection for improved load balancing • Disadvantages • More coding required • Multithreading coding/debugging generally trickier than single threaded programming • Thread packages differ from platform to platform • Scales to a point, then rolls off
EMC Centera – API Best Practices Multithreading • Make the number of threads configurable individually for Read and Write • A good combined number is 20 threads per access node • This needs to be configured at install time • For large numbers of threads, increase the value of FP_OPTION_MAXCONNECTIONS (default is 100) • No application exists in a vacuum • Be conscious of workload imposed by other applications
EMC Centera – API Best Practices Managing Small Content - Object Count Limitations • CentraStar 3.1 / Gen 4 has 50 million object count / node limitation • This count includes all types of Centera objects • CDF • Blobs • Mirror copies • Parity fragments • Reflections • CDF should be designed to fully utilise capacity (bytes) before these object count limits are encountered. • Embedded Blobs cuts down object usage by at least 50%
EMC Centera – API Best Practices Managing Small Content - Embedded Blobs CDF Blob • Whenever a write or read is done, two objects are transferred • Writes • the Blob is written, added to the Tag in the CDF being constructed and when fully constructed the CDF is written. • Reads • the CDF is read, the application navigates to the Tag containing the content and the Blob is retrieved
CDF Blob EMC Centera – API Best Practices Managing Small Content - Embedded Blobs • With embedded blobs, there is no separate blob so all data is transferred as a single object when the CDF is read or written • The SDK transparently stores the Blob as an Attribute on the Tag inside the CDF • Base64 encoded to adhere to the XML character set • Developer does not write any “special” code other then enabling the feature • I/O operations are reduced by at least half • only the CDF is read/written • proportionately greater savings if multiple blobs are stored in the CDF • Can only be used for relatively small objects (< 100KB)
EMC Centera – API Best Practices Managing Small Content - Embedded Blobs • Embedded Blobs are easily enabled within an application • Globally via an FPPool option • FPPool_SetGlobalOption(FP_OPTION_EMBEDDED_DATA_THRESHOLD, 100*1024) • Threshold (100KB in the example above, max is 100KB) is then used to determine how the data is stored • Explicitly on the FPTag_BlobWrite call • FPTag_BlobWrite(theTag, theStream, FP_OPTION_LINK_DATA) • FPTag_BlobWrite(theTag, theStream, FP_OPTION_EMBED_DATA) • Overrides any Global setting that is in force
EMC Centera – API Best Practices Managing Small Content - Embedded Blobs • Advantages • Can dramatically decrease object count usage if multiple blobs are stored embedded in each CDF • Reduces I/Os (blob does not need to be read separately) • Easy to code • Disadvantages • Single instance storage is lost • XML-Compatible Data Encoding (Base64) increases storage requirement by 33% • Read performance can be impacted • All blob content is retrieved when opening the clip • The larger CDF takes longer to parse • Standard guidelines should be followed i.e. CDF size < 10MB
EMC Centera – API Best Practices Managing Small Content - Containerization Content Descriptor File Image name Byte offset Length 1023.jpg 21078 2497 1023.jpg 21078 2497 Container Blob • Small objects are collected and inserted into a larger container object • Each individual’s byte offset and length is stored in the metadata • When an object is retrieved, a byte offset read is done through the API and only the small object is returned
EMC Centera – API Best Practices Managing Small Content - Containerization • Advantages • Better utilization of available bandwidth • Much faster ingest of a large number of small pieces of content • Reduces the object count • Disadvantages • More coding involved • Deletion of individual object requires re-writing and re-indexing entire container • No Single Instance Storage • Limited “use case” • Only applicable where huge numbers of small objects require to be stored in the same C-Clip • Size of CDF would become unmanageable / non-performant (100MB limit) • Embedded Blobs strategy is preferable in most situations
Check1003 Check1004 Check1005 Check1006 EMC Centera – API Best Practices Managing Small Content - Hybrid Containerization • Combines aspects of Embedded Blobs and Containers • “Containers” are constructed using multiple embedded blobs • CDF effectively becomes the container • Each blob still represents a single application-level object CDF
EMC Centera – API Best Practices Managing Small Content - Hybrid vs. Classic Containers • Individual object indexing not required • Local storage managed by SDK rather than application • Application does not need to build a local container • The CDF becomes the container • Simplified deletion of individual objects from container • Automatic Single Instance Storage for objects larger than the embedded blob threshold • No code changes required
EMC Centera – API Best Practices Managing Huge Content - Blob Slicing • Write blob data to Centera using multiple threads provided by the application • Enables increased performance at time of write • No increase in performance for blobs < 5MB in size • Segments are exported to Centera as if they are different blobs • Segments are referenced by a single tag • The same method as the internal 100MB blob segmentation feature
EMC Centera – API Best Practices Managing Huge Content - Blob Slicing • FPTag_BlobWritePartial(Tag, Stream, options, sequenceID) • Sequence ID determines the sequence of data written by multiple threads for one tag • sets the order in which data is to be read back by the SDK • must be greater than 0 • Duplicate IDs on a tag are not allowed and will return error • Read-back is performed in ascending Sequence ID order • Transparently supports FPTag_BlobRead() and FPTag_BlobReadPartial().
EMC Centera – API Best Practices Managing Huge Content - Blob Slicing • Does not operate with any embedded options • Linked data only • FP_OPTION_EMBED_DATA causes an error • FP_OPTION_EMBEDDED_DATA_THRESHOLD setting is ignored
EMC Centera – API Best Practices Managing Huge Content - Blob Slicing • FPTag_CreatePartialFileForInput(FilePath, Perm, BuffSize, Offset,Length) • Similar to CreateFileForInput • Allows for a section of the input file to be written • Two additional parameters • Offset where reading should begin within the file • Length of the segment to be read • Transparently supports FPTag_BlobWrite() and FPTag_BlobWritePartial().
EMC Centera – API Best Practices Pool Management – Creation process FPPool_Open(“10.0.0.1,10.0.0.2”) • Pool of sockets (and associated data structures) is allocated • Probe packet is sent to the first address provided • if a response is not received before timeout (configurable, default is 2 minutes), subsequent addresses will be tried in sequence • Response is received from another AN, which contains a list of all known ANs ordered by load • The replica for the cluster is probed (with the same default timeout of 2 minutes) • The entire replica chain is walked • Allowing for timeouts, this could be a lengthy process X Primary Replica1 Replica2
EMC Centera – API Best Practices Pool Management – Recommended Strategy FPPool_Open • FPPool_Open should be called once when the application is started • Subsequent Centera I/O should be done using this single Pool Reference • When the application shuts down, call FPPool_Close Write Read Read Write Read Read Write Write Write FPPool_Close
Application #2 Email Archiving EMC Centera – API Best Practices Centera in a Multi-Application Repository Environment Application #1 ECM Application #3 PC Backup Centera
Cluster Pool App Pool 1 App Pool 2 App Pool 3 Default Pool = = Blob CDF Pool 1 Pool 2 Pool 3 Default Pool EMC Centera – API Best Practices Virtual Pools How can we partition data access?
EMC Centera – API Best Practices Virtual Pools • Virtual Pools implement a form of Data Partitioning • Think of Virtual Pools as Virtual Centeras within a physical Centera cluster • Applications should • connect to Virtual Pools through Access Profiles and Capabilities
EMC Centera – API Best Practices Metadata • Metadata is the key to disaster recovery • Rich metadata helps to identify individual pieces of content given local repository failure • Clip level metadata can be retrieved as part of a Query Result set when performing Disaster Recovery • FPQueryExpression_SelectField(myQueryExp, “aClipAttribute”); • CenteraSeek relies on metadata • capability to “Google” the Centera • this is not what Query is intended for! • chargeback reporting • uses metadata only, not the document content itself • all types of Metadata can be queried • Standard SDK metadata e.g. creation.date • Clip level metadata e.g. ApplicationName (attribute added by application) • Tag level metadata e.g. InvoiceNumber (attribute added by application)
EMC Centera – API Best Practices Metadata - Sample CDF <?xml version='1.0' encoding='UTF-8' standalone='no'?> <ecml version="3.0"> <eclipdescription> <meta name="type" value="Standard"/> <meta name=“name" value=“TrainingClip"/> <meta name="creation.date" value="2004.08.05 09:31:19 GMT"/> <meta name="modification.date" value="2004.08.05 09:35:51 GMT"/> <meta name="numfiles" value=“1"/> <meta name="totalsize" value="2082"/> <meta name="refid" value="5DD0B54HG7OCG3UTUGV1FP004Q"/> <meta name="prev.clip" value=""/> <meta name="clip.naming.scheme" value="MD5"/> <meta name="numtags" value=“1"/> <meta name="sdk.version" value="3.0.377"/> <custom-meta name=“ApplicationName" value=“MetadataExample"/> <custom-meta name=“ApplicationVendor" value=“EMC_Engineering"/> <custom-meta name=“ApplicationVersion" value=“3.4"/> </eclipdescription> <eclipcontents> <myMainTag someMeaningfulAttribute=“aValue" anotherAttribute=“andAnmotherValue”> <eclipblob md5="DM6JEBLJFH9I" size="694327" offset="0"/> </myMainTag> </eclipcontents> </ecml>
To: DBA EFG242769LH32e57R23E2IBC4FE EMC Centera – API Best Practices DisasterRecovery • What happens if local Content Address repository is lost? • Protect the relationship between the archive and the local store • Metadata assists in local store reconstruction using Query • Store the Transaction Logs or Incremental database backups on Centera • Email resulting C-Clip IDs to DBA or store on Profile Clip • Use a separate Virtual Pool exclusively for these backups • Logs / Backups can be easily retrieved in DR scenario Application Server DB Centera
EMC Centera – API Best Practices Storage Strategy Capacity (SSC) • Allows for Single Instance Storage • prevents identical content being stored numerous times • Uses M or M++ naming schemes • M++ performs additional SHA-256 calculation (performance overhead) • Reduces collision potential • Content Address is 27 (M) or 53 (M++) bytes long • M++ CA incorporates part of the SHA-256 calculated ID
EMC Centera – API Best Practices Storage Strategy Performance (SSPP / SSPF) • Blobs below the threshold (default 250K) are written using a 53 byte Content Address incorporating a Time element • 2 different types of available • Partial (SSPP) • C-Clips retain the standard 27 byte Content Address • Should only be used by applications which cannot 53 byte CA • Full (SSPF) • C-Clips and Blobs both use the 53 byte CA • Always use FP_OPTION_CLIENT_CALCID_STREAMING as content cannot pre-exist due to the change in the Content Address • When using Generic Streams of unknown size, set FP_OPTION_PREFETCH_SIZE >= SSP threshold to ensure the correct strategy is used • Use the SDK defined constant for Content Address Length to ensure compatibility with future naming schemes
EMC Centera – API Best Practices SDK Options • The SDK allows for configuration of many options to control different aspects of PoolConnection behaviour • Buffer sizes • Failover strategies • Storage strategies • Timeouts / Retry limits • 2 main types of options • Global options which apply to all PoolReferences created by the application • Local options which affect individual PoolReference instances
EMC Centera – API Best Practices Buffer Sizes • FP_OPTION_BUFFERSIZE • Size of the CDF buffer in bytes. • Will swap out to disc if CDF is larger than the buffer • Default 16KB / Min 1KB / Max 10MB • Set this value to exceed the total CDF size (XML + Blob data) to avoid swapping to disc • e.g.(150 * 1024) for a C-Clip with a single embedded blob • FP_OPTION_PREFETCH_SIZE • Size of the temporary buffer used by the SDK to assist with content length based storage decisions • Storage Strategy Performance or Capacity • Parity or Mirroring • Default 32 KB / Maximum 1 MB. • If using Generic Streams of unknown length, set this value to 1MB
EMC Centera – API Best Practices Failover Strategies • Can be enabled or disabled for all operations • FP_OPTION_ENABLE_MULTICLUSTER_FAILOVER • Network failover occurs when connectivity to the Primary cluster is lost • Content failover occurs when an operation (read, write, delete, query or exists) relating to a particular C-Clip does not return success • Can be configured separately for each operation • FP_OPTION_MULTICLUSTER_XXXX_STRATEGY • Three possible strategies • FP_NO_STRATEGY / FP_FAILOVER_STRATEGY / FP_REPLICATION_STRATEGY • NB: Not all strategies are supported for all operations and default behavior differs.
EMC Centera – API Best Practices Storage Strategies • FP_OPTION_DEFAULT_COLLISION_AVOIDANCE • Enable (by supplying value FP_TRUE) if • Single Instance Storage is unlikely to be of value • Remote chance of a collision possibility is unacceptable • can also be used as an option on FPTag_BlobWrite to control behaviour at the Object (rather than Pool) level • FP_OPTION_EMBEDDED_DATA_THRESHOLD • Already covered in depth
EMC Centera – API Best Practices Setting Options as Environment Variables • Any pool option may be set as an environment variable • When set in this way, all options behave as if they are GlobalOptions • e.g. set FP_OPTION_OPENSTRATEGY=“FP_LAZY_OPEN” • Options set within the application code take precedence over those set in the environment • Benefits: • Allows customers to adopt options that are not used by the developer/ISV • Increases options for troubleshooting applications in the field • Can sometimes enable immediate bug fixes in the field as the developer/ISV works on a patch release
EMC Centera – API Best Practices Authorized Capabilities • Identifies the Authorized operations that can be performed by an Application connecting to a Pool / Profile combination • Read, Write, Delete, Exists, Query, Privileged Delete, Monitor • Purge capability is deprecated and should not be used • Applications should proactively determine which capabilities are available (see FPPool_GetCapability() in API Guide) • Configure the application user interface / options accordingly
EMC Centera – API Best Practices Managing Retention • C-Clips can have an associated Retention Period or Retention Class • Mandating the setting of retention can be enforced at the pool level. • If the application does not require Retention, the Retention Period should be explicitly set to zero. • Do not rely on default Retention Period of the cluster • Default on CE+ is infinity! • Retention is applied to the whole CDF rather than individual Tags. • C-Clip should contain objects with the same retention requirements • Use Retention Classes for well defined “common” retention Periods (typically set by Regulatory Body) • Allows “painless” update should the period change • CentraStar 3.1 introduced Advanced Retention features • Separate license • Event Based Retention / Litigation Hold / Retention Governors
EMC Centera – API Best Practices Managing Retention • Retention is calculated from creation.date timestamp • set when the FPClip_Create is called • Retention can be changed only by creating new clip • existing C-Clip is opened • updates are made e.g. retention period / class changed • FPClip_Write is called and a new Content Address is returned • old Content Address is saved in prev.clipid • new creation timestamp is set in modification.date • Important: a new clip is created, so you must either • Delete the old one immediately using DELETE_PRIVILEGED • Not available on CE+ editions • Or “Walk” the clip chain and delete when *all* the retentions have expired
EMC Centera – API Best Practices Application Registration • Applications should register via FPPool_RegisterApplication() prior to calling FPPool_Open() • For each pool connection established, the Centera records: • Application name / version (as specified in the API call) • Hostname • Hardware platform • Operating system • Profile used to connect • SDK version • The Application Name supplied to the API call should be constant • do not use argv[0]! • Information is retained for all application instances which have authenticated for a period equal to the ‘audit log retention period’.
EMC Centera – API Best Practices Questions? • Please Visit SDK Forums on Centera Developer’s Portal • http://lighthouse.emc.com • Corporate Systems Engineering team provides: • Info, Training, Whitepapers • Design Advice / Design Reviews • Debugging Support • Code Reviews • Email Support Available from: • CenteraCSE@emc.com