440 likes | 571 Views
Introduction to Data Warehousing. Randy Grenier Rev. 8/12/2013. Contents. History OLTP vs. OLAP Paradigm Shift Architecture Emerging Technologies Questions. History: Hollerith Cards. Once upon a time… Reporting was done with data stored on Hollerith cards .
E N D
Introduction to Data Warehousing Randy Grenier Rev. 8/12/2013
Contents History OLTP vs. OLAP Paradigm Shift Architecture Emerging Technologies Questions
History: Hollerith Cards Once upon a time… • Reporting was done with data stored on Hollerith cards. • A card contained one record. • Maximum length of the record was 80 characters. • A data file consisted of a stack of cards. • Data was “loaded” each time a report was run. • Data had no “modeling” per se. There were just sets of records. • Programs in languages such as COBOL, RPG, or BASIC would: • Read a stack of data cards into memory • Loop through records and perform a series of steps on each (e.g. increment a counter or add an amount to a total) • Send a formatted report to a printer • It was difficult to report from multiple record types. • Changes to data were implemented by simply adding, removing or replacing cards.
History: Hollerith Cards FACTOIDS • Card type: IBM 80-column punched card • A/K/A: “Punched Card”, “IBM Card” • Size: 7 3⁄8 by 3 1⁄4 inches • Thickness: .007 inches (143 cards per inch) • Capacity: 80 columns with 12 punch locations each
History: Magnetic Tape Card were eventually replaced by magnetic tape. Tapes made data easier to load and storage more efficient. Records were stored sequentially, so individual records could not be quickly accessed. Data processing was still very similar to that of cards—load a file into computer memory and loop through the records to process. Updating data files was still difficult.
History: Disk Storage History The arrival of disk storage revolutionized data storage and access. It was now possible to have a home base for data: a database. Data was always available: online. Direct access replaced sequential access so data could be accessed more quickly. Data stored on disk required new methods for adding, deleting, or updating records. This type of processing became known as Online Transaction Processing (OLTP). Reporting from a database became known as Online Analytical Processing (OLAP).
History: Disk Storage History FACTOIDS • Storage Device: IBM 350 disk storage unit • Released: 1956 • Capacity: 5 million 6-bit characters (3.75 megabytes). • Disk spin speed: 1200 RPM. • Data transfer rate: 8,800 characters per second.
History: Relational Model History In the 1960’s, E.F. Codd developed the relational model. Relational modeling was based on a branch of mathematics called set theory and added rigor to the organization and management of data. Relational modeling also improved OLTP (inserting, updating, and deleting data) by making these processes more efficient and reducing data anomalies. The relational model also introduced primary keys, foreign keys, referential integrity, relational algebra, and a number of other concepts used in modern database systems. It soon became apparent that different data models facilitated OLAP vs. OLTP. Relational data was often denormalized (made non-relational) to support OLAP. Structured Query Language (SQL) was the first language created to support relational database operations both OLAP and OLTP.
History: Relational Model FACTOIDS Although E.F. Codd was employed by IBM when he create the relational model and IBM originated the SQL language (then “SEQUEL”), IBM was not the first vendor to produce a relational database or to use SQL. The first commercial implementation of relational database and SQL was from Relational Software, Inc. which is now Oracle Corporation. SQL has been standardized by standards organizations American National Standards Institute (ANSI) and the International Standards Organization (ISO).
History: Extracts 1 Report OLTP Source OLAP Extract In early relational databases, data was extracted from OLTP systems into denormalized extracts for reporting.
History: Extracts 2 History Report OLTP Source OLAP Extract Report OLTP Source OLAP Extract Report OLAP Extract And more extracts...
History: Extracts 3 And more extracts...
History: Extracts 4 And more extracts...
History: Naturally Evolving Systems 1 Naturally evolving systems began to emerge.
History: Naturally Evolving Systems 2 • Naturally evolving systems resulted in • Poor organization of data • Extremely complicated processing requirements • Inconsistencies in extract refresh status • Inconsistent report results. • This created a need for architected systems for analysis and reporting. • Instead of multiple extract files, a single source of truth was needed for each data source.
History: Architected Systems Developers began to design architected systems for OLAP data. In the 1980’s and 1990’s, organizations began to integrate data from multiple sources such as accts. receivable, accts. payable, HR, inventory, and so on. These integrated OLAP databases became known as Enterprise Data Warehouses (EDWs). Over time methods and techniques for extracting and integrating data into architected systems began to evolve and standardize. The term data warehousing is now used to refer to the commonly used architectures, methods, and techniques for transforming and integrating data to be used for analysis.
History: An Architected Data Warehouse OLTP DM Report History OLTP DM Report Staging Data set DM OLTP ODS DM Data set OLTP Example of an Architected Data Warehouse
History: Compare Naturally Evolving System 18 18 Compare: Naturally evolving system
History: Compare Architected Data Warehouse OLTP DM Report History OLTP DM Report Staging Data set DM OLTP ODS DM Data set OLTP Compare: Architected Data Warehouse
History: Inmon In the early 1990’s W.H. Inmon published Building the Data Warehouse (ISBN-10:0471141615) Inmon put together the quickly accumulating knowledge of data warehousing and popularized most of the terminology we use today. Data in a data warehouse is extracted from another data source, transformed to make it suitable for analysis, and loaded into the data warehouse. This process is often referred to as Extract, Transform and Load (ETL). Since data from multiple sources was integrated in most data warehouses, Inmon also described the process as Transformation and Integration (T&I). Data in a data warehouse is stored in history which is modeled for fast performance when querying the data. The history tables are the source of truth. Data from history is usually extracted into data marts which are used for analysis and reporting. Separate data marts are created for each application. There is often redundant data across data marts.
History: Inmon FACTOIDS • W.H. Inmon coined the term data warehouse. • W.H. Inmon is recognized by many as the father of data warehousing. • W.H. Inmon created the first and most commonly accepted definition of a data warehouse: A subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions. • Other firsts of W.H. Inman • Wrote the first book on data warehousing • Wrote the first magazine column on data warehousing • Taught the first classes on data warehousing
History: Kimball 1 A cube used to represent multi-dimensional data Also in the 1990’s, Ralph Kimball published The Data Warehouse Toolkit (ISBN-10:0471153370) which popularized dimensional modeling. Dimensional modeling is based on the cube concept which is a multi-dimensional view of data. The cube metaphor can only illustrate three dimensions. A dimensional model can be any number of dimensions.
History: Kimball 2 Kimball implemented cubes as star schemas which support querying data in multiple dimensions.
History: Kimball 3 Kimball’s books do not discuss the relational model in depth, but his dimensional model can be explained in relational terms. A star schema is a useful way to store data for quickly slicing and dicing data on multiple dimensions. Dimensional modeling and star schema are frequently misunderstood and improperly implemented. Queries against incorrectly designed tables in a star schema can skew report results. The term OLAP has come to be used specifically to refer to dimensional modeling in many marketing materials. Star schemas are implemented as data marts so that they can be queried by users and applications. However, data marts aren’t necessarily star schema.
History: Kimball 4 FACTOIDS • Ralph Kimball had a Ph.D. in electrical engineering from Stanford University. • Kimball worked at the Xerox Palo Alto Research Center (PARC). PARC is where laser printing, Ethernet, object-oriented programming, and graphic user interfaces were invented. • Kimball was a principal designer of the Xerox Star Workstation which was the first personal computer to use windows, icons, and mice.
Paradigm Shift: For Management For Management • Traditional development life cycle doesn’t work well when building a data warehouse. There is a discovery process. Agiledevelopment works better. • OLTP data was designed for a given purpose, but OLAP is created from data that was designed for some other purpose—not reporting. It is important to evaluate data content before designing applications. • OLAP data may not be complete or precise per the application. • Data integrated from different sources may be inconsistent. • Different code values • Different columns • Different meaning of column names • OLAP data tend to be much larger requiring more resources. • Storage, storage, storage…
Paradigm Shift: For DBAs For DBAs • Different system configurations (in Oracle, different initialization parameters) • Transaction logging may not be used, and methods for recovery from failure are different. • Different tuning requirements: • Selects are high cardinality (large percentage of rows) • Massive sorting, grouping and aggregation • DML operations can involve thousands or millions of records. • Need much more temporary space for caching aggregations, sorts and temporary tables. • Need different backup strategies. Backup frequency is based on ETL scheduling instead of transaction volume. • May be required to add new partitions and archive old partitions in history tables. • Storage, storage, storage…
Paradigm Shift: For Architects & Developers For Architects and Developers • Different logical modeling and schema design. • Use indexes differently (e.g. bitmap rather than b-tree) • Extensive use of partitioning for history and other large tables • Different tuning requirements • Selects are high cardinality (large percentage of rows) • Lots of sorting, grouping and aggregation • DML operations can involve thousands or millions of records. • ETL processes are different than typical DML processes • Use different coding techniques • Use packages, functions, and stored procedures but rarely use triggers or constraints • Many steps to a process • Integrate data from multiple sources • Iterative and incremental development process (agile development)
Paradigm Shift: For Analysts and Data Users For Analysts and Data Users—All Good News A custom schema (data mart) can be created for each application per the user requirements. Data marts can be permanent, temporary, generalized or project-specific. New data marts can be created quickly—typically in days instead of weeks or months. Data marts can easily be refreshed when new data is added to the data warehouse. Data mart refreshes can be scheduled or on demand. In addition to parameterized queries and SQL, there may be additional query tools and dashboards (e.g. Business Intelligence, Self-Service BI, data visualization, etc.). Several years of history can be maintained in a data warehouse—bigger samples. There is a consistent single source of truthfor any given data set.
Architecture: Main Components Operational Data Data Warehouse OLAP Data ETL ETL OLTP DM Report History REF OLTP DM Report Staging Data set DM ODS DM Data set Components of a Data Warehouse
Architecture: Staging and ODS • New data is initially loaded into staging so that it can be processed into the data warehouse. • Many options are available for getting operational data from internal or external sources into the staging area • SQL Loader • imp/exp/impdp/expdp • Change Data Capture (CDC) • Replication via materialized views • Third-party ETL tools • Staging contains a snapshot in time of operational data. • An Operational Data Store (ODS) is an optional component that is used for near real time reporting. • Transformation and integration of data in an ODS is limited. • Less history (shorter time span) is kept in an ODS. Operational Data Data Warehouse OLAP Data ETL OLTP DM Report History REF OLTP DM Report Staging Data set DM ODS DM Data set Staging and ODS
Architecture: History and Reference Data • History includes all source data—no exclusions or integrity constraints. • Partitioning is used to: • manage extremely large tables • improve performance of queries • to facilitate “rolling window” of history. • Denormalization can be used to reduce number of joins when selecting data from history. • No surrogate keys—maintain all original code values in history. • Reference data should also have history (e.g. changing ICD9 codes over time). Operational Data Data Warehouse OLAP Data ETL OLTP DM Report History REF OLTP DM Report Staging Data set DM ODS DM Data set History and Reference Data
Architecture: Data Marts • Data marts are per requirements of users and applications. • Selection criteria (conditions in WHERE clause) are applied when creating data marts. • Logical data modeling is applied at data mart level (e.g. denormalized, star schemas, analytic data sets, etc.). • Integrity constraints can be applied at data mart level. • Any surrogate keys can be applied at data mart level (e.g. patient IDs). • Data marts can be Oracle, SQL Server, text files, SAS data sets, etc. • Data marts can be permanent or temporary for ongoing or one-time applications. • Data mart refreshes can be scheduled or on demand. Operational Data OLAP Data ETL ETL OLTP DM Report History REF OLTP DM Report Staging Data set DM ODS DM Data set Data Marts
Emerging Technologies Emerging technologies that are having an impact on data warehousing Massively Parallel Processing (MPP) In-Memory Databases (IMDB) Unstructured Databases Column-Oriented Databases Database Appliances Data Access Tools Cloud Database Services
Emerging Technologies: MPP Massively Parallel Processing (MPP) Data is partitioned over hundreds or even thousands of server nodes. A controller node manages query execution. A query is passed to all nodes simultaneously. Data is retrieved from all nodes and assembled to produce query results. MPP systems will automatically partition and distribute data using their own algorithms. Developers and architects need only be concerned with conventional data modeling and DML operations. MPP systems make sense for OLAP and data warehousing where queries are on very large numbers of records.
Emerging Technologies: IMDB In-Memory Database Data is stored in random access memory (RAM) rather than on disk or SSD. Memory is accessed much more quickly reducing seek times. Traditional RDBMS software often uses a memory cache when processing data, but it is optimized for limited cache with most data stored on disk. IMDB software has modified algorithms to be optimized to read data from memory. Database replication with failover is typically required because of the volatility of computer memory. Cost of RAM has dropped considerably in recent years making IMDB systems more feasible. Microsoft SQL Server has an In-Memory option. Tables must be defined as memory optimized to use this feature. Oracle has recently announced the upcoming availability of their In-Memory Option.
Emerging Technologies: Unstructured Databases Unstructured Databases Unstructured databases-- sometimes referred to as NoSQL databases--support vast amounts of text data and extremely fast text searches. Unstructured databases utilize massively parallel processing(MPP) and extensive text indexing. Unstructured databases do not fully support relational features such as complex data modeling, join operations and referential integrity. However, these databases are evolving to incorporate additional relational capabilities. Oracle, Microsoft, and other RDBMS vendors are creating hybrid database systems that incorporate unstructured data with relational database systems. Unstructured databases are useful for very fast text searches on very large amounts of data—they are generally not useful for complex transaction processing, analyses and informatics.
Emerging Technologies: Big Data FACTOIDS Big data became an issues as early as 1880 with the U.S. Census which took several years to tabulate with then existing methods. The term information explosion was first used in a the Lawton Constitution, a small-town Oklahoma newspaper in 1941. The term big datawas used for the first time in an article by NASA researchers Michael Cox and David Ellsworth. The article discussed the inability of current computer systems to handle the increasing amounts of data. Google was a pioneer in creating modern hardware and software solutions for big data. Parkinson’s Law of Data: “Data expands to fill the space available.” 1 exabyte= 10006 bytes = 1018 bytes = 1000 petabytes = 1 billion gigabytes.
Emerging Technologies: Column-Oriented Column-Oriented Databases Data in a typical relational database is organized by row. The row paradigm is used for physical storage as well as the logical organization of data. Column-Oriented databases physically organize data by column while still able to present data within rows. Data is stored on disk in blocks. While the row-oriented databases store the contents of a row in a block, column-oriented databases store the contents of a column in a block. Each column has row and table identifiers so that columns can be combined to produce rows of data in a table. Since most queries select a subset of columns (rather than entire rows), column-oriented databases tend to perform much better for analytical processing (e.g. querying a data mart). Microsoft SQL Server and Oracle Exadata have support for column-based data storage.
Emerging Technologies: Appliances Database Appliances A database appliance is an integrated, preconfigured package of RDBMS software and hardware. The most common type of database appliance is a data warehouse appliance. Most major database vendors including Microsoft and Oracle and their hardware partners package and sell database appliances for data warehousing. Data warehouse appliances utilize massively parallel processing(MPP). Database appliances generally do not scale well outside of the purchased configuration. For example, you generally don’t add storage to a database appliance. The database appliance removes the burden of performance tuning. Conversely, database administrators have less flexibility. A database appliance can be a cost-effective solution for data warehousing in many situations.
Emerging Technologies: Data Access Tools Data Access Tools Business Intelligence (BI) tools allow users to view and access data, create aggregations and summaries, create reports, and view dashboards with current data. BI tools typically sit on top of data marts created by the architects and developers. Data marts that support BI are typically star schema. Newer Self-Service BI tools add additional capabilities such as allowing users to integrate multiple data sources and do further analysis on result data sets from previous analyses. Data visualization tools allow users to view data in various graphs. Newer tools allow users to access and analyze data from multiple form factors including smart phones and tablets. Data access, BI and data visualization tools do not always provide the capability to perform complex analyses or fulfill specific requirements of complex reports (e.g. complex statistical analyses or studies submitted to journals). Programming skills are frequently still required.
Emerging Technologies: Cloud Databases Cloud Database Services • Oracle, Microsoft, and other database vendors offer cloud database services. • The cloud service performs all database administrative tasks: • Replicate data on multiple severs • Make backups • Scale growing databases • Performance tuning • Cloud services can be useful for prototyping and heuristic development. A large commitment to hardware purchases and administrative staff can be postponed for later assessment. • Cloud services could result in considerable cost savings for some organizations. • A cloud hybrid database is one that has database components both on the cloud and on local servers. • Cloud services may limit administrative options and flexibility vs. having your own DBAs. • Cloud services may not meet regulatory requirements for security and storage for some industries (e.g. medical data).
Questions? Operational Data Data Warehouse OLAP Data ETL ETL OLTP DM Report History REF OLTP DM Report Staging Data set DM ODS DM Data set