CS 440 Database Management Systems

CS 440 Database Management Systems RDBMS Architecture and Data Storage

Announcements • Normal form and FD practice session on Feb 4th in the class. • Assignment 1 due on Feb 7th • Submission through TEACH. • Project progress report due on Feb 11th • 1 – 2 pages of status report • Submission through TEACH

Database Implementation User Requirements SQL Physical Storage Conceptual Design Schema Data Relational Model Entity Relationship(ER) Model Files and Indexes

The big advantage of RDBMS • It separates logical level (schema) from physical level (implementation). • Physical data independence • Users do not worry about how their data is stored and processes on the physical devices. • It is all SQL! • Their queries work over (almost) all RDBMS deployments.

Issues in logical level • Data models • Relational, XML , … • Query language • Data quality • normalization • Usability • ...

Issues on physical level • Processor: 100 – 1000 MIPS • Main memory: 1μs – 1 ns • Secondary storage: higher capacity and durability • Disk random access : Seek time + rotational latency + transfer time • Seek time: 4 ms - 15 ms! • Rotational latency: 2 ms – 7 ms! • Transfer time: around 1000 Mb/ sec • Read, write in blocks.

Storage capacity versus access time offline tape nearline tape & optical disks 1015 1013 magnetic optical disks 1011 electronic secondary online tape 109 typical capacity (bytes) electronic main 107 from Gray & Reuter updated in 2002 105 cache 103 103 10-9 10-6 10-3 10-0 access time (sec)

Storage cost versus access time from Gray & Reuter 104 cache electronic main online tape 102 electronic secondary magnetic optical disks nearline tape & optical disks dollars/MB 100 10-2 offline tape 10-4 103 10-9 10-6 10-3 10-0 access time (sec)

Gloomy future: Moor’s law • Speed of processors and cost and maximum capacity of storage increase exponentially over time. • But storage (main and secondary) access time grows much more slowly. • This is why managing and analyzing big data is hard.

Issues in physical level Three things are important in the database systems: performance, performance, and performance! ( Bruce Lindsay, co-creator of System R)

Issues in physical level • Other things also matter • Reliability when it comes to transactions. • … • But performance is still a big deal.

Is it easy to achieve good performance? • Let’s build an RDBMS. • It supports core SQL. • No stored procedure for this version!

Storing Data • Store each relation in an ASCII file: Person (SSN, Name, Age) person.txt: 111222333 - John - 24 444222111 - Charles - 43

Storing Data • Store schema information in a catalogue relation: Catalogue (AttrName, Type, RelName, Position) catalogue.txt: SSN - String – Person - 1 Name - String - Person - 2 Age – Integer – Person - 3

SQL Support • SQL compiler • Like any other computer language compiler. SELECT SSN FROM Person; SSN 111222333 444222111

Query Execution: Selection • Find the selection attribute position from the catalogue. • Scan the file that contains the relation. • Show the tuples that satisfy the condition. SELECT * FROM Person WHERE SSN = 111222333;

Query Execution: Join • Read the catalogue to find the info on join attributes. • Read the first relation, for each tuple: • Read the second relation, for each tuple: • Assemble the join tuple • Output if they satisfy the condition. SELECT * FROM Person, PersonAddr WHERE Person.SSN = PersonAddr.SSN and Person.SSN = 111222333;

Performance Issues: Storing Data • Update John to Sheldon • Rewrite the whole file very slow • Type conversion slow • Delete the tuple with SSN of 111222333. Person (SSN, Name, Age) person.txt: 111222333 - John - 24 444222111 - Charles - 43

Performance Issues: Selection • We have to scan the whole relation to select some tuples  very slow • We can use an index to find the tuples much fasters. SELECT * FROM Person WHERE SSN = 111222333;

Performance Issues: Selection • Read tuples one by one • Much faster if we read a whole bunch of them together: caching SELECT * FROM Person WHERE SSN = 111222333;

Performance Issues: Join • Quadratic I/O access • Very slow for large relations SELECT * FROM Person, PersonAddr WHERE Person.SSN = PersonAddr.SSN and Person.SSN = 111222333;

Performance Issues: Query Execution • Two ways of executing the query • First join, then select • First select, then join much faster • Query (execution) optimization. SELECT * FROM Person, PersonAddr WHERE Person.SSN = PersonAddr.SSN and Person.SSN = 111222333;

Reliability • Update the name in person • Power outage  is the operation done? • Disk crash Update Person SET Name = Smith WHERE Person.SSN = 111222333;

Probably not that many people download our RDBMS  • Let’s redesign the components of our RDBMS

Database Implementation Data storage User Requirements Physical Storage Conceptual Design Schema Relational Model Entity Relationship(ER) Model Files and Indexes

Random access versus sequential access • Disk random access : Seek time + rotational latency + transfer time. • Disk sequential access: reading blocks next to each other • No seek time or rotational latency • Much faster than random access

Units of data on physical device • Fields: data items • Records • Blocks • Files

Fields • Fixed size • Integer, Boolean, … • Variable length • Varchar, … • Null terminated • Size at the beginning of the string

Records: Sets of Fields • Schema • Number of fields, types of fields, order, … • Fixed format and length • Record holds only the data items • Variable format and length • Record holds fields and their size, type, … information • Range of formats in between

Record Header • Pointer to the record schema ( record type) • Record size • Timestamp • Other info …

Blocks • Collection of records • Reduces number of I/O access • Different from OS blocks • Why should RDBMS manage its own blocks? • It knows the access pattern better than OS. • Separating records in a block • Fixed size records: no worry! • Markers between records • Keep record size information in records or block header.

Spanned versus un-spanned • Unspanned • Each records belongs to only one block • Spanned • Records may store across multiple records • Saves space • The only way to deal with large records and fields: blob, image, …

Heap versus Sorted Files • Heap files • There is not any order in the file • New blocks (records) are inserted at the end of file. • Sorted files • Order blocks (and records) based on some key. • Physically contiguous or using links to the next blocks.

Average Cost of Data Operations • Insertion • Heap files are more efficient. • Overflow areas for sorted files. • Search for a record • Sorted files are more efficient. • Search for a range of records • Sorted files are more efficient. • Deletion • Heap files are more efficient • Although we find the record faster in the sorted file.

Indirection • The address of a record on the disk • Physical address • Device ID, Cylinder #, Track #, … • Map physical addresses to logical addresses • Flexible in moving records for insertion and deletion • Costly lookup • Many options in between, tradeoff Physical address on disk Logical address

Block Header • Data about block • File, relation, DB IDs • Block ID and type • Record directory • Pointer to free space • Timestamp • Other info …

Row and Column Stores • We have talked about row store • All fields of a record are stored together.

Row and Column Stores • We can store the fields in columns. • We can store SSNs implicitly.

Row versus column store • Column store • Compact storage • Faster reads on data analysis and mining operations • Row store • Faster writes • Faster reads for record access (OLTP) • Further reading • Mike Stonebreaker, et al, “C-Store, a column oriented DBMS”, VLDB’05.

CS 440 Database Management Systems