260 likes | 425 Views
Database Systems 236363. Introduction. Bureaucratic Info (1/2). Lecturer: Prof. Roy Friedman Office: Taub 605 E-mail: roy@cs.technion.ac.il TA in Charge: Omer Katz Office: Taub 603 E-mail: omerkatz@cs.technion.ac.il TA: Ella Bolshinski Office: Taub 315
E N D
Database Systems236363 Introduction
Bureaucratic Info (1/2) • Lecturer: Prof. Roy Friedman • Office: Taub 605 • E-mail: roy@cs.technion.ac.il • TA in Charge: Omer Katz • Office: Taub 603 • E-mail: omerkatz@cs.technion.ac.il • TA: Ella Bolshinski • Office: Taub 315 • E-mail: ellabo@cs.technion.ac.il
Bureaucratic Info (2/2) • Web site: http://webcourse.cs.technion.ac.il/236363/ • Home assignments: 3 dry, 1 wet • Grade structure: • 80% final exam, 20% home assignments (takef) • 8% for the wet assignment, 4% each dry one • Home assignments are mandatory – no final grade without a pass grade in each home assignment • Those who repeat the course must submit all home assignments
Topics • Introduction • Entity Relationship Diagrams (ERD) • Relational Algebra • Relational Calculus • Database Design Theory • Functional Dependencies • Schemas Decomposition and Normal Forms • XML and the Query Language Xpath • Advanced topics or datalog – time permitting • In the recitation: The SQL Query Language In general, this course is about using DBs rather than implementation details
What’s New This Semester • The lecture slides are in English • Based on the slides of Eldar Fischer and Johann Makowsky • Some change in material • We will not study Xquery • We might not study Datalog • Wet home assignment might be in Java • To be determined next week
Let the Fun Begin… • Database • A (persistent) collection of data • Often with some logical structure • Examples include, e.g., bank accounts, students listed to courses and their grades, geographical data used by a map/navigation service, customers of an online web-site, dentist’s patients • A Query Language • Enables querying and manipulating the database • Examples include, e.g., Structured Query Language (SQL), Datalog, Cassandra Query Language (CQL) • Database Management System (DBMS) • The system that manages the database and supports the execution of queries on the database
Why Do We Need a DBMS? • After all, we have operating systems and file systems… • DBMS provides a data oriented abstraction for manipulating the data • Enables direct manipulation of the data without worrying about storage and execution issues • Frees the programmer from worrying about many low level details such as: • Serializing and de-serializing the data to the storage • Organizing the data in the storage and masking storage latencies and inefficiencies • How to enable concurrent access to the database? • What to do when the database is larger than physical memory? • Split/cached database operation in combined mobile/cloud
Data Model • The data model defines the framework for how data is represented • For example, in the relational model, data is represented as tables (relations) • The query language enables extracting data from the database according to the given data model • For example, SQL can express queries on tables and the results are tables as well
DBMS Functionality • Storage management • Query processor • Optimizes query processing for efficient information retrieval and query execution • Concurrency control and recovery • Data integrity • E.g., that an ID number is a unique 9 digit number • Security • Access control, authentication and encryption
Data Representation Independence • To obtain independence between the data model and the physical storage, we separate between three levels • User view, logical level, and physical level View Independence of logical layout beyond this level Data is organized here according to the data model (relational) Logical Independence of physical layout beyond this level Physical
Upper Database Layer • In the end-user layer, each user is provided with a (potentially partial) view that may be different from the actual data layout
Operations on a Database • Database structure definition (following an analysis of the application’s needs): Includes the logical structures for representing the data and their relations • Data Definition Language (DDL) • Query execution to retrieve data from the database • Data manipulation: adding, deleting, and updating • Data Manipulation Language (DML) • Administrative operations: • Defining views, indexes, etc.
Database Administrator (DBA) • Responsible for • Planning the logical database layout and adapting it to the physical layer • Security and access control • Recovery management (after failures) • Performance fine tuning
Data Models (1/2) • Relational data model • Data is represented using tables • Correspondence between tables is obtained by using same values in columns with the same name • The main focus of this course. • Entity Relation • A tool for analyzing the requirements of a database and designing its schema • This model is an abstract one and has no actual direct implementation
Data Models (2/2) • Object Oriented data model • A model in which the data is represented as objects similarly to what is done in OOP • ERD can be mapped to OO • Semi-structured data model • Data is represented as a graph (independent of the physical layout) • Extensible Markup Language (XML) • A specific instance of the semi-structured model in which the graph is a tree
The Relational Data Model • We wish to represent a collection of objects of a given type, each characterized by a fixed set of properties • For example, student’s name, date of birth, department • How would we do this in C or Java? • In the relational data model, we maintain these objects in a table • Each row is used to store one object • Each column represents one of the properties • The table is a logical structure • Might be physically stored in a completely different manner • Each row must be different than each other row in at least one attribute • The table represents a set rather than a multi-set
Relational Model: Terminology Schema (title of table) An attribute (column name) A relation (the entire table) A record (an entire row)
A Formal Definition • For a given set of attributes A1,…,Anand a set of corresponding domains D1,…,Dn(each domain is a set of values) • Denote R(A1,…,An) the relational schema that contains the attributes A1,…,An • A relation r over R is a subset of the domain product r ⊆ D1xD2x…xDn
The Formal Definition - Visualized • Instead of viewing each record as a line in a table, it is viewed as a point in the space of all possible value assignments. The relation is a finite subset of “all possible records”. John Doe, 01/01/1990 Name Jane Roe, 07/07/1992 Birth Date
Keys • A superkey of a relation r is a subset of attributes of r’s schema such that specific values of these attributes identify a single record in r. In other words, there are no two records in r whose values in all attributes of the superkey are the same. • A relation may have several superkeys. A superkey is called minimal if none of its subsets is a superkey. Such a key is also called a candidate key. • One of the superkeys can be selected as the primary key. The primary key is used to identify a row in the implementation of the database. Yet, in PostgreSQL, when no key is defined, a table can include multiple records with the same values in all attributes
Simple Databases • In the simplest case, all objects of interest are of the same “kind”, meaning that they all have the same attributes list – they are only distinguished by their specific attributes values • For example, the list of songs on my computer – each such object is characterized by the name of the song, the format (mp3/wma/…), playing time, and size • In these cases, all objects are organized in a single table
More Involved Databases • Suppose we wish to design a database for the faculty’s administrative assistants. Here, we can identify at least two types of entities: • Students: student name, id, address • Courses: course name, catalogue number, lecturer • If a student is registered to a given course, we should be able to know about it and be able to retrieve the student’s final grade • Hence, each student’s participation in a given course should be recorded somewhere • The question is how to organize the database for this?
Possible Organization • A simple option: Have a single large table – for each student’s registration, we will hold the student’s name, id, address, course name, catalogue number, lecturer, and final grade • Drawbacks: • Redundancy: Why should the student’s address be stored in each course she takes? • Inadequacy: How can we maintain the details of a student that does not take any course? • Difficult to update: If a student changes his address, we will need to update all records of all courses he is registered to. This is both expensive and a source of inconsistency
Another Option • We can maintain one table for students, one for courses, and one for registration • The registration table schema can include a primary key for the students (e.g., student id) and a primary key for the courses (e.g., catalogue number) as well as the final grade • Now each student’s data is independent of each course’s data and vice versa
Another Option Students Courses Registrations
Life is Full of Difficult Choices • What if we wish to retrieve the names of all lecturers who taught a given student? • If in the registrations table we only maintain the course’ catalogue number, this query will require a long time to compute • Further, if the lecturers change over time, the reply will not be accurate • How do we fix this? • Should we add a lecturer attribute to the registrations table? • Should we define a new “lecturers” table and corresponding relations? • The database organizational design choices are not trivial – this is the subject of much of this course