500 likes | 729 Views
Building the Data Warehouse: Transforming Data. Objectives. After completing this lesson, you should be able to do the following: Define transformation Identify possible staging models Identify data anomalies and eliminate them Explain the importance of quality data
E N D
Objectives • After completing this lesson, you should be able to do the following: • Define transformation • Identify possible staging models • Identify data anomalies and eliminate them • Explain the importance of quality data • Describe techniques for transforming data • Design transformation process • List Oracle’s enhanced features and tools that can be used to transform data
Transformation • Transformation eliminates anomalies from operational data: • Cleans and standardizes • Presents subject-oriented data Transform: Clean up Consolidate Restructure Extract Warehouse Operationalsystems Load Data Staging Area
Possible Staging Models • Remote staging model • Onsite staging model
Operationalsystem Operationalsystem Warehouse Warehouse Remote Staging Model Data staging area within the warehouse environment Transform Extract Load Staging area Data staging area in its own environment Transform Extract Load Staging area
On-site Staging Model • Data staging area within the operational environment,possibly affecting the operational system Transform Extract Load Operational system Staging area Warehouse
Data Anomalies • No unique key • Data naming and coding anomalies • Data meaning anomalies between groups • Spelling and text inconsistencies
Transformation Routines • Cleaning data • Eliminating inconsistencies • Adding elements • Merging data • Integrating data • Transforming data before load
Transforming Data: Problems and Solutions • Multipart keys • Multiple local standards • Multiple files • Missing values • Duplicate values • Element names • Element meanings • Input formats • Referential Integrity constraints • Name and address
Product code = 12M654313 45 Salesperson code Country code Sales territory Productnumber Multipart Keys Problem • Multipart keys
Multiple Local Standards Problem • Multiple local standards • Tools or filters to preprocess cm DD/MM/YY 1,000 GBP inches MM/DD/YY FF 9,990 cm DD-Mon-YY USD 600
Multiple Files Problem • Added complexity of multiple source files • Start simple Multiple source files Logic to detectcorrect source Transformed data
Missing Values Problem • Solution: • Ignore • Wait • Mark rows • Extract when time-stamped If NULL thenfield = ‘A’
Duplicate Values Problem • Solution: • SQL self-join techniques • RDMBS constraint utilities ACME Inc ACME Inc ACME Inc SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);
Element Names Problem • Solution: • Common naming conventions Customer Client Customer Contact Name
Element Meaning Problem • Avoid misinterpretation • Complex solution • Document meaning in metadata Customer’s name All customer details All details except name Customer_detail
Input Format Problem EBCDIC ASCII “123-73” 12373 ACME Co. áøåëéí äáàéí Beer (Pack of 8)
Referential Integrity Problem • Solution: • SQL anti-join • Server constraints • Dedicated tools
Name and Address Problem • Single-field format • Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565
Name and Address Processing in Oracle9i Warehouse Builder • Name and address mapping operator supports: • Parsing • Standardization • Postal matching and geocoding
Quality Data: Importance and Benefits • Quality data: • Key to a successful warehouse implementation • Quality data helps you in: • Targeting right customers • Determining buying patterns • Identifying householders: private and commercial • Matching customers • Identify historical data
Quality: Standards and Improvements • Setting standards: • Define a quality strategy • Decide on optimal data-quality level • Improving operational data quality: • Consider modifying rules for operational data • Document the sources • Create a data stewardship program • Design the cleanup process carefully • Initial cleanup and refresh routines may differ
Data Quality Guidelines • Operational data: • Should not be used directly in the warehouse • Must be cleaned for each increment • Is not simply fixed by modifying applications
Data Quality: Solutions and Management • Solutions: • COBOL, Java, 4GL • Specialized tools • Customized data conversion process • Investigation • Conditioning and Standardization • Integration • Management: • Take responsibility • Resolve problems • Data quality manager
Transformation Techniques • Merging data • Adding a Date Stamp • Adding Keys to Data
Merging Data • Operational transactions do not usually map one-to-one with warehouse data. • Data for the warehouse is merged to provide information for analysis.
Adding a Date Stamp • Time element can be represented as a: • Single point in time • Time span • Add time element to: • Fact tables • Dimension data
Product Table Product_id Time_key Product_desc Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Item Table Item_id Dept_id Time_key Adding a Date Stamp:Fact Tables and Dimensions
#1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #3 Sale 1/2/98 12:00:02 Anchovy Pizza $12.00 #4 Return 1/2/98 12:00:03 Anchovy Pizza - $12.00 #5 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys #dw1 Sale 1/2/98 12:00:01 Ham Pizza $10.00 #dw2 Sale 1/2/98 12:00:02 Cheese Pizza $15.00 #dw3 Sale 1/2/98 12:00:04 Sausage Pizza $11.00 Adding Keys to Data
Summarizing Data 1. During extraction on staging area 2. After loading to the warehouse server Operational databases Staging area Warehouse database
Sources Stage Rules Publish Extract Transform Load Query Maintaining Transformation Metadata • Transformation metadata contains: • Transformation rules • Algorithms and routines
Maintaining Transformation Metadata • Restructure keys • Identify and resolve coding differences • Validate data from multiple sources • Handle exception rules • Identify and resolve format differences • Fix referential integrity inconsistencies • Identify summary data
Data Ownership and Responsibilities • Data ownership and responsibilities should be shared by the: • Operational team • Data warehouse team • Business benefit gained with “work together” approach
Transformation Timing and Location • Transformation is performed: • Before load • In parallel • Can be initiated at different points: • On the operational platform • In a separate staging area
Choosing a Transformation Point • Workload • Impact on environment • CPU usage • Disk space • Network bandwidth • Parallel execution • Load window time • User information needs
Monitoring and Tracking • Transformations should: • Be self-documenting • Provide summary statistics • Handle process exceptions
Designing Transformation Processes • Analysis: • Sources and target mappings, business rules • Key users, metadata, grain • Design options: • Third-party tools • Custom 3GL programs • 4GLs like SQL or PL/SQL • Replication • Design issues: • Performance • Size of the staging area • Exception handling, integrity maintenance
Transformation Tools • Third-party tools • SQL*Loader • In-house developed programs
Datawarehouse Oracle’s Enhanced Featuresfor Transformation • Transformation methods Loading intostaging tables Stagingtable 1 Flat Files Transformdata Validatedata Stagingtable 2 Merge intowarehouse tables Stagingtable 2 Multi stage Transformation
External tables Externaltable Flat Files Merge intowarehouse tables Warehousetables Table functions Validatedata Transformdata Oracle’s Enhanced Featuresfor Transformation • Transformation methods Pipelined Transformation
Oracle’s Enhanced Featuresfor Transformation • Transformation mechanisms • Using SQL: • CREATE TABLES AS SELECT (CTAS) • UPDATE • MERGE • Multitable INSERT Cust Customer Existingrowupdated Merge New rowinserted
Condition Oracle’s Enhanced Featuresfor Transformation • Transformation mechanisms • Multitable INSERT Sourcetable Targettable 2 Targettable 3 Targettable 1
Oracle’s Enhanced Featuresfor Transformation • Transformation mechanisms (continued) • Using PL/SQL: • Used for complex transformations • Using Table Functions: Table Functions can: • Return multiple rows from a function • Accept results of multiple row SQL subqueries as input • Take cursors as input • Be parallelized • Support incremental pipelining
Summary • In this lesson, you should have learned how to: • Define transformation • Identify possible staging models • Identify data anomalies and eliminate them • Explain the importance of quality data • Describe techniques for transforming data • Design transformation process • List Oracle’s enhanced features and tools that can be used to transform data
Practice 6-1 Overview • This practice covers the following topics: • Answering a series of questions based on the business scenario for Frontier Airways • Answering a series of short questions