1.27k likes | 1.58k Views
SQL Unit 15 Normalization. prepared by Kirk Scott. 1. Normal Forms 2. First Normal Form 3. Second Normal Form 4. Third Normal Form 5. Boyce- Codd Normal Form 6. Higher Normal Forms 7. Domains 8. Nulls and Integrity. 1. Normal Forms.
E N D
SQL Unit 15Normalization prepared by Kirk Scott
1. Normal Forms • 2. First Normal Form • 3. Second Normal Form • 4. Third Normal Form • 5. Boyce-Codd Normal Form • 6. Higher Normal Forms • 7. Domains • 8. Nulls and Integrity
1. Normal Forms • The benefits of relational database theory can be summarized as follows: • There is a step-by-step way of arriving at a correct design • There is a way of detecting flaws in a design • The design process has to do with the problem domain, not with computer-related questions
The database designer and user are protected from questions related to the implementation of the dbms and the hardware it’s running on • Finally, if the design is correct, it will be possible to: • Store all desired information in it; • Update the information on an ongoing basis; • Retrieve any/all of the information as needed.
Correct designs are based on what are called normal forms. • This section presents the background information to the design process. • It also discusses and illustrates the use of normal forms.
Identifying Entities • At its most basic level, design of a database depends on determining what you want to store information about. • When deciding what the base tables will be, you are trying to identify entities. • From a language point of view, this involves identifying nouns which do not modify other things.
Identifying Attributes • Identifying the entities leads to identifying their attributes. • Attribute names usually end up being nouns too, but you figure out what they are when you try to describe entities, and the descriptions usually involve adjectives. • One of the key points of database design is that you only store information about the entities and attributes you need to.
There may be many possible entities • All entities may have a long list of potential attributes • But you limit yourself to only those things you will need to retrieve information about in the future.
Identifying Keys • You are familiar with primary keys and foreign keys. • When trying to organize the attributes around entities in the design, the idea is to equate an entity with a primary key field • Then group the attributes with the entities that they describe. • Relationships between tables are captured by embedding the primary keys of one or more tables as foreign keys in other tables.
Functions, Determination, and Dependency • When described in general, the foregoing sounds sensible enough. • That’s why the book claims that if you can model successfully, the result will be a correct design • In practice it can be difficult to do without formal guidelines. • This is what the normal forms provide.
The normal forms are based on and described in terms of an idea taken from math. • One field in a table may functionally determine another. • Stated in reverse order: The other field depends functionally on the one.
This is an example of a mathematical function: • y = f(x), for example, y = x2 • y is a function of x. • x is in the domain and y is in the range. • x functionally determines y • Or, y functionally depends on x.
For a mathematical function, you find the dependent value by doing some sort of computation on the determining value. • The key point underlying a function is the following: • For each value of x, there can only be one corresponding value of y. • x uniquely determines y.
The analogy in database design is the following: • The primary key of a table should functionally determine the values of the other fields in the table. • In other words, the non-key fields should functionally depend on the primary key field. • Just as the primary key uniquely identifies a record, it uniquely determines the values of the fields in the record
Take this small table for example: • This is its schema: • Person(SSN, name, dob)
You don’t find a person’s name or birthdate by doing a computation on their social security number. • However, given any one social security number, there is exactly one corresponding name and exactly one corresponding date of birth. • It is true that different people with different social security numbers may have the same name and the same date of birth, but this is not a problem.
The point of the primary key field is that it is the unique identifier that makes it possible to distinguish between these two people. • This idea came up at the beginning of the course • The point now is that the name and date of birth fields functionally depend on the social security number field.
A new notation can be used to indicate this. • In this notation, the arrows go from the field that functionally determines another field, to the field that is dependent. • This is illustrated on the next overhead.
Normal Forms • Some of the normal forms are identified by number, for example 1st, 2nd, and 3rd normal forms. • Others are identified by name, for example Boyce-Codd normal form, named after the people who discovered it. • These four normal forms are abbreviated 1NF, 2NF, 3NF, and BCNF, respectively. • There are also higher normal forms, 4th, 5th, and domain key normal forms (4NF, 5NF, DKNF).
The normal forms have to do with finding dependencies in tables which spring from fields other than the primary key. • These dependencies are undesirable and may be referred to as stray dependencies. • The normal forms make increasingly strict statements about the kinds of stray dependencies that have to be eliminated from correctly designed tables. • Designs containing stray dependencies are said to violate the normal forms.
Eliminating Dependencies • The design process using normal forms consists of repetitive steps: • Make a design • Identify stray dependencies (normal form violations) • Redesign to eliminate the dependencies • Once you’ve eliminated all occurrences of one type of violation, you will have promoted the design into the next higher normal form • Repeat until you’ve reached the highest normal form
The rule of thumb at every stage is to remove stray dependencies in the following way: • Make any field which determines other fields the primary key of a new table, and move the fields that depend on that field to the new table. • Make sure that the new table is connected to the old table by a primary key, foreign key pair.
Anomalies • Design problems that are based on violations of normal forms lead to what are called anomalies. • The hallmark of a problematic design is that the same information is stored multiple times. • In other words, there is redundancy in the database. • Depending on the nature of the redundancy, this can lead to problems when inserting data, when updating data, and when deleting data.
Justifying Normal Forms • The use of normal forms may seem unnecessarily theoretical at first. • However, they provide a convenient way of identifying problems in designs and then eliminating them. • Normal forms are what justify these claims about relation databases: • There is a step-by-step way of arriving at a correct design • There is a way of detecting flaws in a design
The Plan of Action for the Following Sections • Each of the following sections will present a normal form in this way: • A definition of the normal form will be given. • A scenario for information to be held in a database will be given, with the underlying assumptions given. • An example database design which violates the normal form will be given
The violation will be shown using a diagram with the notation indicating functional dependencies. • The desired functional dependencies from the primary key will be shown using arrows below the field names. • Undesired, stray dependencies, which need to be eliminated in order to correct the design, will be shown using arrows above the field names.
Anomalies resulting from the incorrect design will be discussed. • In general, there will be insert, update, and delete anomalies • Finally, a corrected design will be given.
Basis for Examples • All of the examples will be based on the general topic of cars, salespeople, customers, and car sales. • Some of the field names are abbreviated, and some of the fields clearly belong together in some way. • Here is a little preliminary explanation regarding the fields that will be in the examples. • Not all of the fields will appear in all of the examples.
vin: vehicle identification number. Vehicles have makes, models, and years. • spno, spname: Salesperson number and name. • custno, custname: Customer number and name. • A car sale has a salesprice and a date.
2. First Normal Form • 1NF Definition: • Formally (Watson): • A relation is in first normal form if and only if all columns are single-valued. • Informally: • Data is stored in flat files; there can be no repeating groups in a record. • (This was mentioned in the very first unit.)
The assumptions underlying the design are that a salesperson can sell many cars, but each car can only be sold by one salesperson. • In this design, each car is only sold once, so the design captures information about the sales of new cars. • These assumptions don’t cause the problem. • It is the implementation of them that causes the problem.
Here is the design that violates 1NF: • Carsale(spno, spname, {vin, salesprice}) • The example design uses {} notation to indicate repeating groups of fields. • A diagram with arrows illustrating this design is given on the next overhead
The repeating group alone is a sufficient problem to make this kind of design incorrect. • However, the design does have anomalies • These foreshadow the kinds of anomalies that all violations entail.
Insert: You can’t store information about a car which hasn’t been sold. • Update: To have an update anomaly, the assumptions would have to be changed. • If a car could be sold more than once, information about it would appear in more than one row of the table • An update to the car information (its vin) would require updating multiple rows
Delete: The deletion anomaly is the mirror image of the insertion anomaly. • If the record of a car sale is deleted, then information about the car is lost, as well as information about the sale.
The solution to all basic normal form violations is the same: • Break out the stray dependency out into a separate table • In this case, break out the information contained in the repeating group
As stated in the assumptions, one salesperson can sell many cars, but each car is sold only once, so there is a 1-m relationship between the two tables in the resulting design. • The primary key of the table containing salesperson information will have to be embedded as a foreign key in the table containing car information.
Here is the corrected design: • Salesperson(spno, spname) • Carsale(vin, salesprice, spnof.k.)
3. Second Normal Form • 2NF Definition: • Formally (Watson): • A relation is in second normal form if an only if it is in first normal form, and all nonkey columns are dependent on the key • Informally: • In a table with a concatenated primary key field, there can be no stray dependencies that originate in just part of the primary key field.
The basic idea is that all nonkey fields have to depend on the whole key. • Stating it in this way will lead to a useful mnemonic device which will be given later. • When you lay it out in this way, you begin to realize that 2NF deals with tables that have concatenated key fields, where a dependency from only one field of the key might be possible.
In this example the underlying assumptions are that the same car can come back to the lot and be sold more than once. • It can be sold by the same salesperson more than once, but not on the same day. • It can also be sold by different salespeople at different times. • Although unlikely, the design is made so that two different salespeople could sell the same car on the same date.
The design doesn’t contain any information about customers, but the scenario would be that one customer brought the car back, and a different salesperson sold it again. • It seems unlikely that the same customer would buy the same car twice, whether on the same date or different dates.
In summary, this design works for used car sales and both the date and the salesperson information are needed, along with the car information, to distinguish between different sales.
Here is the design that violates 2NF: • Carsale(vin, spno, date, spname) • A diagram with arrows illustrating this is given on the next overhead
This faulty design has insert, update, and delete anomalies. • Suppose a salesperson has not yet sold a car. • In this case, it is not possible to insert information about that salesperson. • On the other hand, a salesperson may make many sales. • This means that the same information about that salesperson would be stored in more than one record in the table. • This is redundancy.
Not only is the redundancy itself wasteful, it leads to the update anomaly. • Suppose the salesperson’s name changes. • Then it’s necessary to update multiple records to reflect this fact, not just one.
The delete anomaly is related to the insert anomaly. • Suppose that as part of the maintenance of the database, on a yearly basis the sales table is cleared. • When you delete the last record containing a sale by a particular salesperson, you not only get rid of the sales record, you also lose the salesperson’s name.