Design Theory for Relational Databases

Design Theory for Relational Databases Functional Dependencies Decompositions Normal Forms

Redundancy • Dependencies between attributes cause redundancy • Ex. All addresses in the same town have the same zip code SSN Name Town Zip ……. 1234 Joe Stony Brook 11733 …….. 4321 Mary Stony Brook 11733 5454 Tom Stony Brook 11733 …………………. redundant

Redundancy and Other Problems • Set valued attributes result in multiple rows in corresponding table • Example: Person (SSN, Name, Address, Hobbies) • A person entity with multiple hobbies yields multiple rows in table Person • Hence, Name, Address stored redundantly • SSN should be the key, but instead (SSN, Hobby) is key of corresponding relation • Person can’t describe people without hobbies! SSN Name Address Hobby 1111 Joe 123 Main biking 1111 Joe 123 Main hiking …………….

Anomalies • Redundancy leads to anomalies: • Update anomaly: A change in Address must be made in several places • Deletion anomaly: Suppose a person gives up all hobbies. Do we: • Set Hobby attribute to null (no, since Hobby is part of key) • Delete the entire row (no, since we lose other information in the row) • Insertion anomaly: Hobby value must be supplied for any inserted row (since Hobby is part of key)

Decomposition • Solution: use two relations to store Person information • Person1 (SSN, Name, Address) • Hobbies (SSN, Hobby) • The decomposition is more general: people with hobbies can now be described • No update anomalies: • Name and address stored once • A hobby can be separately supplied or deleted

Normalization Theory • Anomalies: • update anomaly occurs if changing the value of an attribute leads to an inconsistent database state. • insertion anomaly occurs if we cannot insert a tuple due to some design flaw. • deletion anomaly occurs if deleting a tuple results in unexpected loss of information. • Normalization is the systematic process for removing all such anomalies in database design, based on functional dependencies

Functional Dependencies • X ->Y is an assertion about a relation R that whenever two tuples of R agree on all the attributes of X, then they must also agree on all attributes in set Y. • Say “X ->Y holds in R.” • Convention: …, X, Y, Z represent sets of attributes; A, B, C,… represent single attributes. • Convention: no set formers in sets of attributes, just ABC, rather than {A,B,C }.

Splitting Right Sides of FD’s • X->A1A2…An holds for R exactly when each of • X->A1 • X->A2,…, • X->An hold for R. • Example: A->BC is equivalent to A->B and A->C. • There is no splitting rule for left sides. • We’ll generally express FD’s with singleton right sides.

Functional Dependencies • Dependencies for this relation: • A  B • A  D • B,C  E,F • Do they all hold in this instance of the relation R? • Functional dependencies are specified by the database programmer based on the intended meaning of the attributes.

Example: FD’s Drinkers(name, addr, beersLiked, manf, favBeer) • Reasonable FD’s to assert: • name -> addr favBeer • Note this FD is the same as name -> addr and name -> favBeer. • beersLiked -> manf

Because name -> favBeer Because name -> addr Because beersLiked -> manf Example: Possible Data name addr beersLiked manf favBeer Janeway Voyager Bud A.B. WickedAle Janeway Voyager WickedAle Pete’s WickedAle Spock Enterprise Bud A.B. Bud Spock Enterprise Bud-Lite A.B. Bud

Keys of Relations • K is a superkey for relation R if K functionally determines all of R. • K is a key for R if K is a superkey, but no proper subset of K is a superkey (or K is minimal superkey).

Example: Superkey Drinkers(name, addr, beersLiked, manf, favBeer) • {name, beersLiked} is a superkey because together these attributes determine all the other attributes. • name -> addr favBeer • beersLiked -> manf

Example: Key • {name, beersLiked} is a key because neither {name} nor {beersLiked} is a superkey. • name doesn’t -> manf; beersLiked doesn’t -> addr. • There are no other keys, but lots of superkeys. • Any superset of {name, beersLiked}.

Where Do Keys Come From? • Just assert a key K. • The only FD’s are K -> A for all attributes A. • Assert FD’s and deduce the keys by systematic exploration.

More FD’s From Derivation • Example: “no two courses can meet in the same room at the same time” tells us: hour room -> course.

Inferring FD’s • We are given FD’s F={X1 -> A1, X2 -> A2,…, Xn -> An}, and we want to know whether an FD Y -> B must hold in any relation that satisfies the given FD’s. • What if we what all possible FD’s that follow from F? • Example: If A -> B and B -> C hold, surely A -> C holds, even if we don’t say so. • Important for design of good relation schemas.

Closure of FDs • If F is a set of FDs on schema R and f is another FD on R, then F entails f if every instance r of R that satisfies F also satisfies f • Ex: F = {A  B, B C} and f is A  C If Streetaddr Town and Town  Zip then Streetaddr  Zip • The closure of F, denoted F+, is the set of all FDs entailed by F • F and G are equivalent if F entails G and G entails F

Computing the Closure F+: Armstrong’s Axioms of FDs • Reflexivity: If Y  X then X  Y (trivial FD) • Name, Address Name • Augmentation: If X  Y then X Z YZ • If Town  Zip then Town, Name  Zip, Name • Transitivity: If X  Y and Y  Z then X  Z

Other derived rules • Union: If X  Y and X  Z, then X  YZ • Derivation using only Armstrong’s Axioms: • X  YX (augment), YX YZ (augment) • thus X YZ (transitive) • Decomposition: If X  YZ, then X  Y and X  Z • Derivation using only Armstrong’s Axioms: • YZ  Y (reflexive), thus X  Y (transitive) • YZ  Z (reflexive), thus X  Z (transitive)

Generating F+ F AB C AB BCD A D AB BD AB BCDE AB CDE D E BCD  BCDE union aug trans decomp aug Thus, AB BD, AB  BCD, AB  BCDE, and AB  CDE are all elements of F+

Computing the Closure F+ via Attribute Closure(Y+) • Calculating attribute closure is a more efficient way of checking entailment • The attribute closureof a set of attributes Y with respect to a set of functional dependencies, F, (denoted YF+or just Y+) is the set of all attributes, A, such that F entails Y  A • Checking entailment: Given a set of FDs, F, then Y  A if and only if A  Y+

Closure Test • F={X1 -> A1, X2 -> A2,…, Xn -> An} • Test to compute the closure of Y, denoted Y +. • Basis: Y+ = Y. • Induction: Look for an FD’s left side Xi that is a subset of the current Y+. If the FD is Xi -> Ai, add Ai to Y+.

Computation of Attribute Closure Problem: Compute the attribute closure of Y=AB With respect to the set of FDs: AB  C (a) A  D (b) D  E (c) AC  B (d) Initially Y+ = Y = {AB} Solution:

Algorithm for Computing F+ via Attribute Closure • For each subset of attributes X, compute X+. • Add non-trivial FDs: X ->A for all A in X + - X. • Drop non-minimal FDs: XY ->A can be dropped whenever we discover X ->A • Some Tricks: • No need to compute the closure of the empty set or of the set of all attributes. • If we find X + = all attributes, so is the closure of any superset of X.

Example – Computing F+ via Attribute Closure Y Y+ New FDs A {A,D,E} A->E B {B} (trivial) C {C} (trivial) D {D,E} (original FD) E {E} (trivial) AB {A,B,C,D,E} (AB is a key) AB->CDE AC {A,B,C,D,E} (AC is a key) AC->BDE AD {A,D,E} (not minimal) AE {A,D,E} (not minimal) BC {B,C} (trivial) BD {B,D,E} BD->E BE {B,E} (trivial) CD {C,D,E} CD->E CE {C,E} (trivial) DE {D,E} (trivial) … (and so on) F = {AB  C, A  D, D  E, AC  B}

Finding All Implied FD’s in a Projected Schema • Motivation: “normalization,” the process where we break a relation schema into two or more schemas. • Example: ABCD with FD’s AB ->C, C ->D, and D ->A. • Decompose into ABC, AD • What FD’s hold in ABC ? • Not only AB ->C, but also C ->A ! • Same as previous algorithm, but restrict to those FD’s that involve only attributes of the projected schema

Example: Projecting FD’s • ABC with FD’s A ->B and B ->C. Project onto AC. • A +=ABC ; yields A ->B, A ->C. • We do not need to compute AB + or AC +. • B +=BC ; yields B ->C. • C +=C ; yields nothing. • BC +=BC ; yields nothing. • Resulting FD’s: A ->B, A ->C, andB ->C. • Projection onto AC : A ->C. • Only FD that involves a subset of {A,C }.

Normal Forms & Relation Decomposition

Relational Schema Design • Goal of relational schema design is to avoid anomalies and redundancy. • Update anomaly: one occurrence of a fact is changed, but not all occurrences. • Deletion anomaly : valid fact is lost when a tuple is deleted • Insertion anomaly : cannot insert a valid fact when some information is unknown (i.e., extraneous attributes in primary key)

Example of Bad Design Drinkers(name, addr, beersLiked, manf, favBeer) name addr beersLiked manf favBeer Janeway Voyager Bud A.B. WickedAle Janeway ??? WickedAle Pete’s ??? Spock Enterprise Bud ??? Bud Data is redundant, because each of the ???’s can be figured out by using the FD’s name -> addr favBeer and beersLiked -> manf.

This Bad Design AlsoExhibits Anomalies name addr beersLiked manf favBeer Janeway Voyager Bud A.B. WickedAle Janeway Voyager WickedAle Pete’s WickedAle Spock Enterprise Bud A.B. Bud • Update anomaly: if Janeway is transferred to Intrepid, • will we remember to change each of her tuples? • Deletion anomaly: If nobody likes Bud, we lose track • of the fact that Anheuser-Busch manufactures Bud. • Insertion anomaly: Cannot insert Picard if we do not know • what beers he likes

Decompositions • Do we need to decompose a relation? • Several normal forms for relations. If schema in these normal forms certain problems don’t arise • The two commonly used normal forms are third normal form (3NF) and Boyce-Codd normal form (BCNF) • What problems does decomposition cause? • Lossless-join property: get original relation by joining the resulting relations • Dependency-preservation property: enforce constraints on original relation by enforcing some constraints on resulting relations • Queries may require a join of decomposed relations!

Boyce-Codd Normal Form • We say a relation R is in BCNF if whenever X ->Yis a nontrivial FD that holds in R, X is a superkey. • Remember:nontrivial means Y is not contained in X. • Remember, a superkey is any superset of a key (not necessarily a proper superset).

Example Drinkers(name, addr, beersLiked, manf, favBeer) FD’s: name->addr favBeer, beersLiked->manf • Only key is {name, beersLiked}. • In each FD, the left side is not a superkey. • Any one of these FD’s shows Drinkers is not in BCNF

Another Example Beers(name, manf, manfAddr) FD’s: name->manf, manf->manfAddr • Only key is {name} . • name->manf does not violate BCNF, but manf->manfAddr does.

Decomposition into BCNF • Given: relation R with FD’s F. • Look among the given FD’s for a BCNF violation X ->Y. • If any FD following from F violates BCNF, then there will surely be an FD in F itself that violates BCNF. • Compute X+. • Not all attributes, or else X is a superkey.

Decompose R Using X -> Y • Replace R by relations with schemas: • R1 = X+. • R2 = X + (R – X+) • Project given FD’s F onto the two new relations.

Example: BCNF Decomposition Drinkers(name, addr, beersLiked, manf, favBeer) F = name->addr, name -> favBeer, beersLiked->manf • Pick BCNF violation name->addr. • Close the left side: {name}+ = {name, addr, favBeer}. • Decomposed relations: • Drinkers1(name, addr, favBeer) • Drinkers2(name, beersLiked, manf)

Example -- Continued • We are not done; we need to check Drinkers1 and Drinkers2 for BCNF. • Projecting FD’s is easy here. • For Drinkers1(name, addr, favBeer), relevant FD’s are name->addr and name->favBeer. • Thus, {name} is the only key and Drinkers1 is in BCNF.

Example -- Continued • For Drinkers2(name, beersLiked, manf), the only FD is beersLiked->manf, and the only key is {name, beersLiked}. • Violation of BCNF. • beersLiked+ = {beersLiked, manf}, so we decompose Drinkers2 into: • Drinkers3(beersLiked, manf) • Drinkers4(name, beersLiked)

Example -- Concluded • The resulting decomposition of Drinkers : • Drinkers1(name, addr, favBeer) • Drinkers3(beersLiked, manf) • Drinkers4(name, beersLiked) • Notice: Drinkers1 tells us about drinkers, Drinkers3 tells us about beers, and Drinkers4 tells us the relationship between drinkers and the beers they like.

Design Theory for Relational Databases