Κεφάλαιο 6

Κεφάλαιο 6 ΣΧΕΔΙΑΣΜΟΣ ΒΑΣΕΩΝ ΔΕΔΟΜΕΝΩΝ ΚΑΙ ΚΑΝΟΝΙΚΟΠΟΙΗΣΗ (Normalization)

Outline • Relational Database Design and Normalization • Relational Database Design - Objectives • Criteria / Guidelines for a good design • Functional Dependencies - The Normal Forms • Decompositions • Dependency Preservation and Lossless-joins • More Dependencies (Multivalued, Join)

Relational Database Design • RELATIONAL DESIGN: Grouping the attributes to form “good” relation schemes (for base relations) • INFORMAL CRITERIA: • Aim for Semantic Clarity • Avoid Redundancy for space efficiency • Avoid Update Anomalies (integrity) • Avoid the need for NULL values in tuples • Aim for Linguistic Efficiency • Aim for Performance

Design Criteria (1) • Aim for Semantic Clarity • Informally (and ideally), each tuple should represent exactly one entity or relationship instance • In other words, relation tuple should not be overloaded with semantic information (e.g., represent more than one facts) • Different entities should not be mixed -- only way for inter-reference between relations should be the “foreign key” • Avoid Redundancy for space efficiency • Redundant storage implies wasting storage • Redundancy may cause inconsistencies (also called, update anomalies)

Design Criteria (2) • Avoid Update Anomalies • The integrity of the database is damaged when insertion, deletion and modification anomalies occur. • An update in one place may cause an unpredictable number of updates in other places. • Avoid Null Values in Tuples • Null values imply “lack of knowledge” or “inapplicability” and frequently cause processing problems (correct execution of queries requires the use of a different logic (higher level) than the two-valued logic we use in traditional systems)

Design Criteria (3) • Aim for Linguistic Efficiency • The simpler the queries in the applications the better for the application programmer and (usually) for the query optimizer • Queries are often simpler when they are executed on relations that have a lot of information (e.g., no need to do many joins for getting the result) • Aim for Performance • As in the previous case, when the relations have very few attributes (e.g., they are binary), a large number of joins is necessary for processing the queries. • MANY OF THE CRITERIA ARE CONTRADICTORY

Example of a Bad Design • Assume a database with one relation scheme: ED(SSN, EName, Salary, DNumber, DName, Location, MgrSSN) Q1: Find all employees that make more than their manager select e.EName from ED e, ED m where e.MgrSSN = m.SSN and e.Salary > m.Salary Q2: For each department, find the maximum salary select DName, max(Salary) from ED group by DNumber

Example of a Bad Design (2) • COMMENTS • The queries are not very complicated (linguistic efficiency) BUT, we get UPDATE ANOMALIES: (a) Redundancy: For every employee, the department information is repeated (b) Modification Anomalies:For a simple change, like changing the manager of a Department, an unpredictable number of tuples must be changed (c) InsertionAnomalies: A new department cannot be entered in the database, unless we have an employee working in this new department (d) DeletionAnomalies: When the last employee in a department is deleted, we lose all information for the department (also called:no independent existence)

Example of a Better Design • We can choose the equivalent design with two relation schemes: EMPLOYEE(SSN, EName, Salary, DNumber) DEPARTMENT(DNumber, DName, Location, MgrSSN) Q1: Find all employees that make more than their manager select e.EName from EMPLOYEE e, EMPLOYEE m, DEPARTMENT d where e.DNumber = d.DNumber and e.MgrSSN = m.SSN and e.Salary > m.Salary

Example of a Better Design (2) Q2: For each department, find the maximum salary select d.DName, max(e.Salary) from EMPLOYEE e, DEPARTMENT d where d.DNumber = e.DNumber group by d.DNumber • The queries are more complex (linguisticinefficiency) and require more joins (performance), BUT AVOID ALL update anomalies • OUR OBJECTIVE IN THE SEQUEL IS TO EXAMINE HOW THIS SCHEMA CAN BE CHOSENFORMALLYAND SYSTEMATICALLY

Functional Dependencies • Functional Dependencies are the most commonformal measure of “goodness” for relational database designs • They are used to define the normal forms for relations • A Functional Dependency specifiesa constraint on all relation instances r(R), but is a property of the attributes in the schema R • DEFINITION: Let X, Y be set of attributes of relation scheme R. We say that the functional dependency (FD): X ® Y holdsif the X-value uniquely determines the Y-value.

FD Definition - continued • ALTERNATIVE DEFINITION: Let X, Y be set of attributes of relation scheme R. We say that the (FD): X ® Y holdsif whenever two tuples in an instance r(R) have the same value for X, then theymust have the same value for Y This is written as: for any two tuples t1 , t2 in any relation instance r(R): If t1 [X] = t2 [X] then t1 [Y] = t2 [Y] • If K is a key in R, then K functionally determines ALL attributes in R (since we can never have two distinct tuples t1, t2 with t1 [K] = t2 [K] )

Properties of FDs • TRIVIAL DEPENDENCY: WheneverY Í X, then X ® Y Example: SSN, Salary ® Salary • FULL FUNCTIONAL DEPENDENCY: We say that a set of attributes Y isfully functionally dependent on a set of attributes X, if it is FD on X and not FD on any subset of X,i.e., Y isfully functionally dependent on X, if X ® Y andthere exists no W such that W Ì X and W ® Y Example: SSN, PNumber ® HrsPW (HrsPW is fully functionally dependent on both SSN and PNumber)

Properties of FDs - continued • PARTIAL DEPENDENCY: As a consequence of the above definition, we say that Y is partially dependent on X, if X ® Y and there exists a W such that W Ì X and W ® Y Example: SSN,Salary ® Address (but also, SSN ® Address) • TRANSITIVE DEPENDENCY: A functional dependency X ® Z istransitive if it can be derived from two other FDs through transitivity (X ® Y and Y ® Z ) Example: SSN ® DNumber and DNumber ® MgrSSN, imply: SSN ® MgrSSN • PRIME ATTRIBUTE: One that is not in a candidate key

NORMALIZATION • Normalization is concerned with incorporating the semantic notions of FDs into therelation schemes themselves. • There are severalNormal Forms, where the following has been proven: All Relations 1NF 2NF 3NF BCNF 4NF 5NF

Normal Forms (1) • FIRST NORMAL FORM (1NF): R is in 1NFif every attribute has an atomic value. We assume that ALL the relations we work with are at least in 1NF. Example: R(ENumber, ChildrenNames) is not in 1NF • SECOND NORMAL FORM (2NF): R is in 2NFif it is in 1NF and no non-prime attribute is partially dependent on a candidate key. Example: SUPPLIER(SNumber, SName, ItemNumber, Price) is not in 2NF, since the combination SNumber, ItemNumberis a candidate key, but also SNumber ® SName holds

Normal Forms (2) • THIRD NORMAL FORM (3NF): R is in 3NF if it is in 2NF and no non-prime attribute is transitively dependent on a candidate key Example: The relation: ED(SSN, EName, Salary, DNumber, DName, Location, MgrSSN) is in 2NF BUT NOT in 3NF, since SSN is the (only) candidate key and we have: SSN ® Dnumber and DNumber ® MgrSSN (i.e., MgrSSN is transitively dependent on SSN)

Normal Forms (3) • Equivalent definition of THIRD NORMAL FORM (3NF): R is in 3NFif and only if for every FD X ® A, where X is a set of attributes in R and A is a single attribute, at least one of the following is true: 1.- A Í X (the FD is trivial) 2.- K Í X (with K being a candidate key of R) 3.- A Í K (with K being a candidate key of R)

Normal Forms (4) • BOYCE-CODD NORMAL FORM (BCNF): R is in BCNFif and only if when an FD X ® Y holdsthen X is a candidate key (where X, Y are sets of attributes in R) • Equivalent definition of BOYCE-CODD NORMAL FORM (BCNF): R is in BCNFif and only if for every FD X ® A, where X is a set of attributes in R and A is a single attribute, at least one of the following is true: 1.- A Í X (the FD is trivial) 2.- K Í X (with K being a candidate key of R) Note: this is exactly like 3NF without option no.3, which shows directly thatBCNF implies 3NF

Normal Forms (5) Example: The relation: RESTAURANT(Client, Food, ReceiptNumber) is in 3NF BUT NOT in BCNF, since we have: Client, Food ® ReceiptNumber, and ReceiptNumber ® Food (this is not a key dependency) • Any relationcanbe transformed toequivalentrelations in 3NF (with the use of well-known algorithms) The process is calleddecomposition or 3NF normalization. • Unfortunately, there are some 3NF relations that cannot be transformed to BCNF (as the characteristic example above)

Normalization Theory • Given a set of FDs F we can determine additional FDs that hold whenever the FDs in F hold - to determine such FDs we needinference rules. A1. if Y Í X then X ® Y (Reflexivity) A2. if X ® Y then XZ ® YZ (Augmentation) A3. if X ® Y and Y ® Z then X ® Z (Transitivity) Armstrong proved that: {A1, A2, A3) is a sound and complete set of inference rules i.e., the rules generate only valid FDs and all FDs that can be inferred will be generated by the above rules

Inference Rules • Other inference rules also hold (but can deduced from Armstrong’s rules {A1, A2, A3) above) A4. if X ® YZ then X ® Y and X ® Z (Decomposition) A5. if X ® Y and X ® Z then X ® YZ (Union) A6. if X ® Y and WY ® Z then WX ® Z (Pseudotransitivity) • Theclosure F+ of a set of FDs F is the set of all FDs that can be inferred from F (by applying the Armstrong rules) • Theclosure X+ of a set of attributes X with respect to a set FDs F is the set of all attributes that are functionally determined by X (by applying the Armstrong rules in F)

Equivalence of FD Sets • Two sets of FDs F and G areequivalent if every FD in F can be inferred from G and every FD in G can be inferred from F (i.e., F and G are equivalent if F+ = G+ ) • F covers G if every FD in G can be inferred from F (i.e., G+ Í F+ ) • A set of FDs is minimal if it satisfies the following: (1) Every dependency in F is of the form: X ® A, where A is a single attribute (2) We cannot remove any dependency from F and still have a set of dependencies which is equivalent to F (3) We cannot replace any dependency X ® A in F with a dependency Y ® A, where Y Ì X and still have a set of dependencies which is equivalent to F

Results from FD Theory • There is a simple algorithm for checking equivalence between two sets of FDs • Every set of FDs has an equivalent minimal set • There is nosimple (efficient) algorithm for computing a minimal set of FDs that is equivalent to a set F of FDs • Having a minimal set is important for several relational design algorithms • To ensure a good relational design we need to establish additional criteria (lossless join property, dependency preserving property) and algorithms that preserve them.

Relational Decomposition • STARING POINT of all algorithms is auniversal relation scheme R containingallthe database attributes • OBJECTIVE of the design is a decomposition D of R into m relation schemas R1, R2, R3, ... Rm where each Ri contains a subset of the attributes of R and every attribute in R should appear in at least one relation scheme Ri • Intuitively, decomposing R means we will store instances of the relation schemes produced by the decomposition, instead of instances of R.

Problems with Decompositions • There are three potential problems to consider: • Some queries become more expensive. • e.g., Queries may now require many JOINS • Given instances of the decomposed relations, we may not be able to reconstruct the corresponding instance of the original relation! • The case of the spurious tuples • Checking some dependencies may require joining the instances of the decomposed relations. • Attributes in dependencies now are in more than one relations • Tradeoff: Must consider these issues vs. redundancy

Dependency Preservation • DEPENDENCY PRESERVATION PROPERTY: The decomposition D should preserve the dependencies; that is, the collection of all dependencies that hold on relations Ri should be equivalent to F (the FDs that hold on R) • Formally: Definition: The projection of F on Ri , denoted by PF(Ri), is the set of FDs X ® Y in F+ such that (X È Y) Í Ri A decomposition D is dependency preserving if: (PF(R1) ÈPF(R2) ... ÈPF(Rm) )+ = F+ • There is an algorithm, called the relational synthesis algorithm, that decomposes R into a dependency preserving decomposition D = {R1, R2, R3, ... Rm} with respect to F, such that each Ri is in 3NF • The algorithm is based on minimal covers and as we said earlier, there are no efficient algorithms to find such covers

Lossless Join Property • LOSSLESS JOIN PROPERTY: This property ensures that no new (spurious) tuples appear when relations in the decomposition are joined Formally, A decomposition D = {R1, R2, R3, ... Rm} of R has the lossless join property with respect to a set of FDs F, if for every relation instance r(R) whose tuples satisfy all the FDs in F, we have: (PR1(r(R)) PR2(r(R)) ... PRm(r(R)) ) = r(R) • This is a very important condition for a decomposition, since it directly impacts how meaningful the queries will be in the decomposed relations.

Spurious Tuples - Example • Consider the COMPANY database and a relation E_P(SSN, PNumber, HoursPW, EName, PName, Location) which is generated by taking the natural joins between EMPLOYEE, PROJECT and WORKS_ON (on instances) and then projecting on the above attributes Convince yourself that the decomposition of E_P into: R1 (EName, Location) R2 (SSN, PNumber, HoursPW, PName, Location) does not have the lossless-join property HINT: Make projections on the instance of E_P (for R1, R2) and then join these projections. You will not get back E_P !

More on Lossless Join • The decomposition of R into X and Y is lossless-join wrt F if and only ifthe closure of F contains: • X Y X, or • X Y Y • In particular, the decomposition of R into UV and R - V is lossless-join if U V holds over R.

Results for Decompositions • There is a simple algorithm for testing whether a decomposition D satisfies the lossless join property with respect to a set F of FDs • There is a simple algorithm for decomposing R into BCNF relations such that the decomposition D satisfies the lossless join property with respect to a set F of FDs on R • There is no algorithm for decomposition into BCNF relations that is dependency preserving • There is an algorithm (modification of the synthesis one) for decompositions into 3NF (not BCNF) relations which guarantees both dependency preservation and lossless join

Decomposition into 3NF ALGORITHM: 1.- Find a minimal set of FDs G equivalent to F 2.- For each X of an FD X ® Y in G Create a relation schema Ri in D with the attributes { X È A1È A2È ..., È Ak} , where each Aj is an attribute appearing in an FD in G with X as left hand side 3.- If any attributes in R are not placed in any Ri then create another relation in D for these attributes 4.- If none of the relations in D contain a key of R, create a relation that contains a key of R and add it to D • The algorithm above is guaranteed to generate relations in 3NF that preserve both properties

Additional Dependencies • Employing FDs we can get as far as 3NF and BCNF • Additional Dependencies are: multivalued dependencies (MVDs),join dependencies (JDs), inclusion dependencies (IDs), ... Such dependencies lead to normal forms beyond 3NF and BCNF (i.e., 4NF and 5NF) • Multivalued Dependencies: (Informal Definition). A set of attributes X multidetermines a set of attributes Y if the value of X determines a set of values for Y (independently of any other attributes)

Multivalued Dependencies • An MVD is written as X ®® Y • There are sound and complete inference rules for MVDs • Every MVD is also an FD (special case) • A relation schema R is in fourth normal form (4NF) with respect to a set of functional and multivalued dependencies F if for any non-trivial multivalued dependency X ®® Y in F+ then X is a candidate key of R • There is an efficient algorithm for decomposing R into 4NF relations such that the decomposition has the lossless join property with respect to a set F of FDs and MVDs on R

Join Dependencies • A join dependency JD(R1, R2, R3, ... Rm) is a constraint on R specifying that every legal instance r(R) should have alossless join decomposition into R1, R2, R3, ... Rm • An MVD is a special case of a JD (where m=2) • A relation schema R is in fifth normal form (5NF) with respect to a set of functional, multivalued and join dependencies F if for any non-trivial JD(R1, R2, ... Rm) in F+ then each Ri is a super key of R • 5NF is sometimes called PJNF

Inclusion Dependencies • FDs, MVDs and JDs are defined within the same relation scheme R (they do not relate attributes from different relation schemes) • There are other constraints, like the inclusion dependencies, that are used to specify referential integrity and class/subclass hierarchiesbetween two relations R and S • An inclusion dependency R.X < S.Y specifies that at any point in time, if r(R) and s(S) are relation instances of R and S, then PX(r(R)) ÍPY(s(S)) • Until now, there are no proposals for normal forms based on the concept of inclusion dependencies

Practical Issues with Normalization • There is a large number of commercial database tools that, given a set of relation schemes and a set of dependencies (usually FDs), automatically generate relation schemes in 3NF (rarely they go for BCNF, 4NF and 5NF) • Another use of such tools is to check a given relation’s level of normalization and as a heuristic for selecting one design instead of another • There are also some practical theory results that let the designer evaluate the design in a very simple manner, like: • If a relation is in 3NF and every candidate key consists of exactly one attribute, then it is also in 5NF (Fagin, 1991)

Comments on Normalization • The Normalization Process has several drawbacks: • It is not constructive -- there is no provision for getting a “good” design (with the criteria listed before) • It is usually applied after we have a schema (to tell us whether this is a good or bad one) • It provides no conceptual design (it focuses on and deals with relations and attributes) • On the other hand, it is a good attempt to formalize some of the things with which we usually work by intuition

Κεφάλαιο 6

Κεφάλαιο 6

Presentation Transcript