400 likes | 1.13k Views
Data Warehousing: Architecture and Data Flows. Mariela Caballero Topic paper #32 CS534. Overview. Data Warehouse Architecture Major Components Data Warehouse Data Flows Its five main data flows Conclusion Quiz. Typical Components of a Data Warehouse Architecture.
E N D
Data Warehousing:Architecture and Data Flows Mariela Caballero Topic paper #32 CS534
Overview Data Warehouse Architecture Major Components Data Warehouse Data Flows Its five main data flows Conclusion Quiz
Typical Components of a Data Warehouse Architecture Operational data sources, Operational datastore(ODS), Load Manager, Warehouse Manager, Query Manager, Detailed Data, Lightly & Highly Summarized Data, Archive & Back up Data, Meta Data, & End-User Access Tools
Operational data • Without source system, there would be no data • The data sources for the data warehouse are supplied as follow: • Operational data held in network databases • Departmental data held in file systems • Private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organization’s suppliers or customers
Operational datastore(ODS) • Is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse • ODS objectives: to integrate information from day-to-day systems and allow operational lookup to relieve day-to-day systems of reporting and current-data analysis demands • ODS can be helpful step towards building a data warehouse because ODS can supply data that has been already extracted from the source systems and cleaned
Load Manager • Called the frontend component • Performs all the operations associated with the extraction and loading of data into the warehouse • These operations include simple transformations of the data to prepare the data for entry into the warehouse • The data is extracted from the operational systems directly or from the operational datastore (more common) and then to the data warehouse • Size and complexity will vary between data warehouses and may be constructed using a combination of vendor data loading tools and custom-built programs.
Warehouse Manager • Performs all the operations associated with the management of the data in the warehouse as follows: • Analysis of data to ensure consistency • Transformation and merging of source data from temporary storage into the data warehouse tables • Creation of indexes and views • Backing-up and archiving data • Constructed using vendor data management tools and custom-built programs. • Generates query profiles to determine which indexes and aggregations are appropriate
Query Manager • Called backend component • Performs all the operations associated with the management of user queries • Directing queries to the appropriate tables and scheduling the execution of queries • Constructed using vendor end-user access tools, data warehousing monitoring tools, database facilities and custom built programs • Query manager complexity depends on the end-user access tools and database
Detailed Data • Stores all the detailed data in the database schema • On a regular basis, detailed data is added to the warehouse to supplement the aggregated data
Lightly and Hightly Summarized Data • Stores all the pre-defined lightly and highly aggregated data generated by the warehouse manager • Transient as it will be subject to change on an on-going basis in order to respond to changing query profiles • The purpose of summary information is to speed up the performance of queries - costly • On the other hand, it removes the requirement to continually perform summary operations (such as sort or group by) in answering user queries • The summarized data is updated continuously as new data is loaded into the warehouse
Archive/Backup Data • Stores detailed and summarized data for the purposes of archiving and backup • May be necessary to backup online summary data if this data is kept beyond the retention period for detailed data • The data is transferred to storage archives such as magnetic tape or optical disk
Meta Data • This area of the warehouse stores all the metadata definitions used by all the processes in the warehouse • Meta-Data is used for a variety of purposes: • Extraction and loading processes • Metadata is used to map data sources to a common view of information within the warehouse • Warehouse management process • Used to automate the production of summary tables • Query management process • Used to direct a query to the most appropriate data source • End-user access tools use metadata to understand how to build a query
End-user Access Tools • Users interact with the warehouse using end-user access tools • Can be categorized into five main groups • Data reporting and query tools – (Query by Example –MS Access DBMS) • Application development tools (application used to access major DBS –Oracle, sybase..) • Executive information system (EIS) tools (For sales, marketing and finance) • Online analytical processing (OLAP) tools (Allow users to analyze the data using complex and multidimentional views-from multiple databases) • Data mining tools (allow the discovery of new patterns and trend by mining a large amount of data using statistical, mathematical tools)
Data Warehousing: Data flows Inflow, Upflow, Downflow, Outflow and Metaflow
Inflow • The processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse • Cleaning include removing inconsistencies, adding missing fields, and cross-checking for data integrity • Transformation include adding date/time stamp fields, summarizing detailed data, deriving new fields to store calculated data • Extract the relevant data from multiple, heterogeneous, and external sources (commercial tools are used) • Then mapped and loaded into the warehouse
Upflow • The process associated with adding value to the data in the warehouse through summarizing, packaging, and distribution of the data • Summarizing the data works by choosing, projecting, joining, and grouping relational data into views that are more convenient and useful to the end users. Summarizing data goes beyond simple relational operations to involves sophistacated statistical analysis including identifying trends, clustering, and sampling the data • Packeging the data involves converting the detailed or summarized information into more useful formats, such as spreadsheets, test documents, charts, other graphical presentations, private databases, and animation. • Distribute the data in appropiate groups to increase its availability and accessibility
Downflow • The processes associated with archiving and backing-up of data in the warehouse • Archiving the effectiveness and performace maintanance is achieved by transferring the older data of limited value to storage archivers such as magnetic tapes, optical disk or digital storage devices • If the databases in a warehouse are very big, partitioning is a useful design option which enables the fragmentation of a table storing enournous number of records into smaller tables. Thus, preserving data warehouse performance • The downflow of data includes the processes to ensure that the current state of the data warehouse can be rebuilt following data loss, or software/hardware failures. Archived data should be stored in a way that allows the re-establishement of the data in the warehouse when required
Outflow • Involves the process associated with making the data availabe to the end-users • This involves two activities such as data accessing and delivering • Data accessing is concerned with satisfying the end users’s requests for the data they need. The main problem here is the creation of an environment so that the users can effectively use the query tools to access the most appropiate data source. • Delivering activity makes possible the information delivery to the user’s systems/workstations. This activity is referred to as a type of ’’publish-and-subscribe” process. Data warehouse publishes several ’business objects’ that are revised periodically by monitoring usage patterns. Users subcriber to the set of business objects that best meets their needs.
Metaflow • Meta-flow is a description of the data contents of the data warehouse, what is in it, where it came from originally, and what has been done to it by way of cleansing, integrating, and summarizing • Managing the metadata (data about the data)
Conclusion • The major components of a data warehouse are: Operational data sources, Operational datastore(ODS), Load Manager, Warehouse Manager, Query Manager, Detailed Data, Lightly & Highly Summarized Data, Data Archive & Back up, Meta Data and End Users tools • The Data Flows of Data Warehouse: Inflow, Upflow, Downflow, Outflow and Metaflow
References 1. Database Systems - A Practical Approach to Design, implementation and Management by Harry Singh 2. Interactive Data Warehousing, by Harry Singh
Quiz Time!! • What architecture component is responsable for all the operations associated with the extraction and loading of data into the warehouse? • Load Manager • Query Manager • End-user Access Tool • Meta Data • None of these • What are the main Data flows in a data warehouse? • Onflow, Upflow, Offlow, Outflow and Meta-flow • Inflow, Backflow, Frontflow, Outflow and Meta-flow • Inflow, Upflow, Downflow, Outflow and Meta-flow • Toflow, Fromflow , Upflow, Downflow • None of these • What is the Data flow associated with making the data availabe to the end-users? • Outflow • Inflow • Downflow • Toflow • None of these
Quiz Time!! What architecture component is responsable for all the operations associated with the extraction and loading of data into the warehouse? Load Manager Query Manager End-user Access Tool Meta Data None of these What are the main Data flows in a data warehouse? Onflow, Upflow, Offlow, Outflow and Meta-flow Inflow, Backflow, Frontflow, Outflow and Meta-flow Inflow, Upflow, Downflow, Outflow and Meta-flow Toflow, Fromflow , Upflow, Downflow None of these What is the Data flow associated with making the data availabe to the end-users? Outflow Inflow Downflow Toflow None of these