400 likes | 505 Views
CS 440 Database Management Systems. RDBMS Architecture and Data Storage. Announcements. Normal form and FD practice session on Feb 4 th in the class. Assignment 1 due on Feb 7 th Submission through TEACH. Project progress report due on Feb 11 th 1 – 2 pages of status report
E N D
CS 440 Database Management Systems RDBMS Architecture and Data Storage
Announcements • Normal form and FD practice session on Feb 4th in the class. • Assignment 1 due on Feb 7th • Submission through TEACH. • Project progress report due on Feb 11th • 1 – 2 pages of status report • Submission through TEACH
Database Implementation User Requirements SQL Physical Storage Conceptual Design Schema Data Relational Model Entity Relationship(ER) Model Files and Indexes
The big advantage of RDBMS • It separates logical level (schema) from physical level (implementation). • Physical data independence • Users do not worry about how their data is stored and processes on the physical devices. • It is all SQL! • Their queries work over (almost) all RDBMS deployments.
Issues in logical level • Data models • Relational, XML , … • Query language • Data quality • normalization • Usability • ...
Issues on physical level • Processor: 100 – 1000 MIPS • Main memory: 1μs – 1 ns • Secondary storage: higher capacity and durability • Disk random access : Seek time + rotational latency + transfer time • Seek time: 4 ms - 15 ms! • Rotational latency: 2 ms – 7 ms! • Transfer time: around 1000 Mb/ sec • Read, write in blocks.
Storage capacity versus access time offline tape nearline tape & optical disks 1015 1013 magnetic optical disks 1011 electronic secondary online tape 109 typical capacity (bytes) electronic main 107 from Gray & Reuter updated in 2002 105 cache 103 103 10-9 10-6 10-3 10-0 access time (sec)
Storage cost versus access time from Gray & Reuter 104 cache electronic main online tape 102 electronic secondary magnetic optical disks nearline tape & optical disks dollars/MB 100 10-2 offline tape 10-4 103 10-9 10-6 10-3 10-0 access time (sec)
Gloomy future: Moor’s law • Speed of processors and cost and maximum capacity of storage increase exponentially over time. • But storage (main and secondary) access time grows much more slowly. • This is why managing and analyzing big data is hard.
Issues in physical level Three things are important in the database systems: performance, performance, and performance! ( Bruce Lindsay, co-creator of System R)
Issues in physical level • Other things also matter • Reliability when it comes to transactions. • … • But performance is still a big deal.
Is it easy to achieve good performance? • Let’s build an RDBMS. • It supports core SQL. • No stored procedure for this version!
Storing Data • Store each relation in an ASCII file: Person (SSN, Name, Age) person.txt: 111222333 - John - 24 444222111 - Charles - 43
Storing Data • Store schema information in a catalogue relation: Catalogue (AttrName, Type, RelName, Position) catalogue.txt: SSN - String – Person - 1 Name - String - Person - 2 Age – Integer – Person - 3
SQL Support • SQL compiler • Like any other computer language compiler. SELECT SSN FROM Person; SSN 111222333 444222111
Query Execution: Selection • Find the selection attribute position from the catalogue. • Scan the file that contains the relation. • Show the tuples that satisfy the condition. SELECT * FROM Person WHERE SSN = 111222333;
Query Execution: Join • Read the catalogue to find the info on join attributes. • Read the first relation, for each tuple: • Read the second relation, for each tuple: • Assemble the join tuple • Output if they satisfy the condition. SELECT * FROM Person, PersonAddr WHERE Person.SSN = PersonAddr.SSN and Person.SSN = 111222333;
Performance Issues: Storing Data • Update John to Sheldon • Rewrite the whole file very slow • Type conversion slow • Delete the tuple with SSN of 111222333. Person (SSN, Name, Age) person.txt: 111222333 - John - 24 444222111 - Charles - 43
Performance Issues: Selection • We have to scan the whole relation to select some tuples very slow • We can use an index to find the tuples much fasters. SELECT * FROM Person WHERE SSN = 111222333;
Performance Issues: Selection • Read tuples one by one • Much faster if we read a whole bunch of them together: caching SELECT * FROM Person WHERE SSN = 111222333;
Performance Issues: Join • Quadratic I/O access • Very slow for large relations SELECT * FROM Person, PersonAddr WHERE Person.SSN = PersonAddr.SSN and Person.SSN = 111222333;
Performance Issues: Query Execution • Two ways of executing the query • First join, then select • First select, then join much faster • Query (execution) optimization. SELECT * FROM Person, PersonAddr WHERE Person.SSN = PersonAddr.SSN and Person.SSN = 111222333;
Reliability • Update the name in person • Power outage is the operation done? • Disk crash Update Person SET Name = Smith WHERE Person.SSN = 111222333;
Probably not that many people download our RDBMS • Let’s redesign the components of our RDBMS
Database Implementation Data storage User Requirements Physical Storage Conceptual Design Schema Relational Model Entity Relationship(ER) Model Files and Indexes
Random access versus sequential access • Disk random access : Seek time + rotational latency + transfer time. • Disk sequential access: reading blocks next to each other • No seek time or rotational latency • Much faster than random access
Units of data on physical device • Fields: data items • Records • Blocks • Files
Fields • Fixed size • Integer, Boolean, … • Variable length • Varchar, … • Null terminated • Size at the beginning of the string
Records: Sets of Fields • Schema • Number of fields, types of fields, order, … • Fixed format and length • Record holds only the data items • Variable format and length • Record holds fields and their size, type, … information • Range of formats in between
Record Header • Pointer to the record schema ( record type) • Record size • Timestamp • Other info …
Blocks • Collection of records • Reduces number of I/O access • Different from OS blocks • Why should RDBMS manage its own blocks? • It knows the access pattern better than OS. • Separating records in a block • Fixed size records: no worry! • Markers between records • Keep record size information in records or block header.
Spanned versus un-spanned • Unspanned • Each records belongs to only one block • Spanned • Records may store across multiple records • Saves space • The only way to deal with large records and fields: blob, image, …
Heap versus Sorted Files • Heap files • There is not any order in the file • New blocks (records) are inserted at the end of file. • Sorted files • Order blocks (and records) based on some key. • Physically contiguous or using links to the next blocks.
Average Cost of Data Operations • Insertion • Heap files are more efficient. • Overflow areas for sorted files. • Search for a record • Sorted files are more efficient. • Search for a range of records • Sorted files are more efficient. • Deletion • Heap files are more efficient • Although we find the record faster in the sorted file.
Indirection • The address of a record on the disk • Physical address • Device ID, Cylinder #, Track #, … • Map physical addresses to logical addresses • Flexible in moving records for insertion and deletion • Costly lookup • Many options in between, tradeoff Physical address on disk Logical address
Block Header • Data about block • File, relation, DB IDs • Block ID and type • Record directory • Pointer to free space • Timestamp • Other info …
Row and Column Stores • We have talked about row store • All fields of a record are stored together.
Row and Column Stores • We can store the fields in columns. • We can store SSNs implicitly.
Row versus column store • Column store • Compact storage • Faster reads on data analysis and mining operations • Row store • Faster writes • Faster reads for record access (OLTP) • Further reading • Mike Stonebreaker, et al, “C-Store, a column oriented DBMS”, VLDB’05.