1 / 30

A Taxonomy of ETL Activities

A Taxonomy of ETL Activities. Panos Vassiliadis 1 , Alkis Simitsis 2 , Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi . gr (2) HP Labs alkis@hp.com. Outline. Motivation Normal Form Taxonomy Design Patterns Conclusions. Outline. Motivation Normal Form

holland
Download Presentation

A Taxonomy of ETL Activities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Taxonomy of ETL Activities Panos Vassiliadis1, Alkis Simitsis2, Eftychia Baikousi1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi.gr (2) HP Labs alkis@hp.com

  2. Outline • Motivation • Normal Form • Taxonomy • Design Patterns • Conclusions DOLAP'09, Hong Kong, Nov. 2009

  3. Outline • Motivation • Normal Form • Taxonomy • Design Patterns • Conclusions DOLAP'09, Hong Kong, Nov. 2009

  4. A developer’s problem Ulysses is a developer in the IT department of his company & he has a problem: he was assigned the task to populate a data mart of the organization with source data => He needs to construct an ETL workflow DOLAP'09, Hong Kong, Nov. 2009

  5. A developer’s problem • Ulysses has a variety of solutions: • Off-the-self ETL tool • PL/SQL scripts • Embedded SQL in Java, C/C++, … • Python, Perl, … scripts • … DOLAP'09, Hong Kong, Nov. 2009

  6. Ulysses also knows that… • Relational DBMSs perform query optimization using • a simple model of data (the relational one) • a (relational) algebra both at the logical and physical levels • Software engineers construct their (O-O) code based on • best practices, or patterns, that allow the efficient construction, maintenance, testing, and execution of code • quality measures that allow the assessment of the quality of the constructed software DOLAP'09, Hong Kong, Nov. 2009

  7. Ulysses also knows that… • Relational DBMSs perform query optimization using • a simple model of data (the relational one) • a (relational) algebra both at the logical and physical levels • Software engineers construct their (O-O) code based on • best practices, or patterns, that allow the efficient construction, maintenance, testing, and execution of code • quality measures that allow the assessment of the quality of the constructed software Wouldn’t it be nice if we had similar tools for ETL flows? DOLAP'09, Hong Kong, Nov. 2009

  8. Can we help Ulysses? Yes, we can! • We can have • a uniform model • to cover a large variety of ETL activities with complicated semantics and provide the framework for the optimization of ETL flows • a taxonomy of ETL activities • based on their internals and the elementary components that characterize them with respect to their properties • a set of archetype patterns • that combine ETL activities in a practical manner that can be used by an ETL engine for optimization, tuning, parallelization, … DOLAP'09, Hong Kong, Nov. 2009

  9. Outline • Motivation • Normal Form • Taxonomy • Design Patterns • Conclusions DOLAP'09, Hong Kong, Nov. 2009

  10. Rationale & Summary of Contribution Is there a uniform, reference way to describe ETL activities? Can we classify both simple activities and complicated activities in an meaningful manner? Is it possible to come up with a simple “algebra” of operations for ETL activities and workflows? DOLAP'09, Hong Kong, Nov. 2009

  11. Rationale & Summary of Contribution We introduce a uniform modeling notation it works for all kinds of ETL activities and ETL flows both, experience and practice verifies that it is based on a mapping to structure of matter Particle ↔ Simple operation Atom ↔ Simple activity Molecule ↔ Complex activity Compound ↔ ETL flow typically, atoms and molecules can be represented as scripts a particle is a certain function call inside the script ETL tools provide atoms and not molecules DOLAP'09, Hong Kong, Nov. 2009

  12. ETL Atoms • ETL atom • an ETL activity that performs exactly one job • it involves exactly one ETL particle • it can be unary or n-ary • it may contain multiple output schemata • it may project out input attributes and generate new ones at the output DOLAP'09, Hong Kong, Nov. 2009

  13. ETL Molecules • ETL molecule • it transfers data from input to output schemata • but there is a linear workflow of particles in between these two groups of schemata. DOLAP'09, Hong Kong, Nov. 2009

  14. Benefits • A normal form is a single, reference way to theoretically discuss properties of ETL activities • Several results can be expressed under this formalism • logical optimization is a prominent case • easy to show that results around swapping of activities (ICDE’05, TKDE’05) fit nicely in this paper • e.g., when can we change the particle order within the same molecule? • what kind of (inter-/intra-) molecule operations can we do? • compose molecules (to be able to form flows) • split or merge molecules (i.e., SW modules) • exchange particles between molecules or swap molecules (for optimization reasons) DOLAP'09, Hong Kong, Nov. 2009

  15. Benefits • Example inter-/intra-molecule operations • Coupling • Activity swapping DOLAP'09, Hong Kong, Nov. 2009

  16. Outline • Motivation • Normal Form • Taxonomy • Design Patterns • Conclusions DOLAP'09, Hong Kong, Nov. 2009

  17. Taxonomy of ETL activities • The taxonomy is based on interrelationship of input-output w.r.t. • schemata • processing of incoming tuples • Exploit taxonomical characteristics for • logical & physical optimization of a workflow • parallelization of activities • any other tuning for improving efficiency, resilience to failures DOLAP'09, Hong Kong, Nov. 2009

  18. Taxonomy of ETL activities Physical-level characteristics blocking semi-blocking non-blocking Final classification unary N-ary # inputs Other DOLAP'09, Hong Kong, Nov. 2009

  19. Applicability DOLAP'09, Hong Kong, Nov. 2009

  20. Benefits • Example uses of taxonomy as heuristics for optimization • 1:1 • locally processed • easily exchangeable with one another (under conditions) • easily parallelizable • N:1 • aggregators are blocking / semi-blocking at best • order or hash sensitive • hard to parallelize • 1:N • tuple producers • order generators • Binary, primary flows • semi-blocking • can be treated as local if right input can be hashed in main memory • Routers • strong possibility to parallelize and enhance pipelining • Rest • mostly blocking, very hard to include in pipelining DOLAP'09, Hong Kong, Nov. 2009

  21. Outline • Motivation • Normal Form • Taxonomy • Design Patterns • Conclusions DOLAP'09, Hong Kong, Nov. 2009

  22. Design Patterns • So far, we have given Ulysses the database related foundations • a notation / model to express ETL molecules (activities) in a uniform way • a taxonomy for the particles (elementary operations) that compose molecules • a set of cases where we can show that optimizations can be decided on the grounds of the above • Now we discuss design related foundations DOLAP'09, Hong Kong, Nov. 2009

  23. Design patterns for ETL • Apart from a normal form for ETL activities, we strive to establish the benefits of having normal forms for combinations of activities • We have introduced butterflies and a set of ETL design patterns (TPC-TC’09) having ‘appropriate’ combinations of activities based on their taxonomical characteristics • These patterns have been proved to be beneficiary for many purposes (design, efficiency, benchmarking) DOLAP'09, Hong Kong, Nov. 2009

  24. Butterflies for ETL DOLAP'09, Hong Kong, Nov. 2009

  25. Butterfly classes DOLAP'09, Hong Kong, Nov. 2009

  26. Benefits • Design Guidelines • allows designers to communicate designs (document, maintain, …) better • Performance • the synthesis of ETL scenarios should be performed in a manner that allows the underlying engine to exploit them. • Beneficiary for the purpose of benchmarking ETL as well DOLAP'09, Hong Kong, Nov. 2009

  27. Benefits in efficiency In V. Tziovara’s MSc @ UoI (also DOLAP’07) it is shown that we can introduce sorters intentionally to a workflow to exploit common sorting of data for sequences of order-dependent operators Different butterflies can benefit from different policies DOLAP'09, Hong Kong, Nov. 2009

  28. Benefits in scheduling Different scheduling policies can be used for different butterflies (A. Karagiannis MSc, UoI) DOLAP'09, Hong Kong, Nov. 2009

  29. Outline • Motivation • Normal Form • Taxonomy • Design Patterns • Conclusions DOLAP'09, Hong Kong, Nov. 2009

  30. Conclusions • We have presented and discussed • a uniform model to cover a large variety of ETL activities with complicated semantics and provide the framework for the optimization of ETL flows • a taxonomy for the internals of ETL activities that allows us to treat them as black-boxes in optimization, scheduling, parallelization, … • a set of design patterns for combinations of ETL activities in archetype patterns • Future work on optimization • optimization techniques • alternative physical implementations for the same logical operations • resource allocation and scheduling policies • possibilities for parallelization DOLAP'09, Hong Kong, Nov. 2009

More Related