1 / 135

Relational Database Internals

Relational Database Internals. Alex Scotti Bloomberg LP. Outline of talk. History - origins and background Internals - theory and practice Internals - brief discussion of real systems Future - observations, trends, predictions. History.

kezia
Download Presentation

Relational Database Internals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relational Database Internals Alex Scotti Bloomberg LP

  2. Outline of talk • History - origins and background • Internals - theory and practice • Internals - brief discussion of real systems • Future - observations, trends, predictions

  3. History • The early database systems differed from the relational ones in 2 main regards • Data model • Transactional semantics • We’ll be more heavily focused on the transactional issues than the data modeling issues in this talk

  4. Pre Relational Systems • Hierarchical Data Models • IMS • Network Data Models • CODASYL

  5. IMS • Each record was typed by a “record type” (think “table”) • Relationships between records are represented as trees (hierarchies) between records linked by their “keys” • Writing a “query” consisted of writing a program to navigate through these links, traversing records until the right one was found.

  6. Data types available were SEQUENTIAL, HASH, TREE • Each acted differently - A program written to use a tree could not have the data structure changed out from under it • lack of PHYSICAL INDEPENDENCE

  7. CODASYL • A more complex “evolution” of the IMS idea, standardized by ANSI, implemented by several vendors • Honeywell, DEC, Univac • Idea is instead of pointers forming a strict hierarchy, they now form an arbitrarily complex “network.” • Able to represent graphs • Even HARDER to program with than IMS

  8. In 1970, Ted Codd wrote the foundational paper “A Relational Model if Data for Large Shared Data Banks” • Codd was primarily a mathematician, not particularly concerned with transaction processing • However, the two problems were incredibly tightly coupled • Work at IBM began on “relational databases” • Including locking, logging, all sorts of things that became the core of an RDBMS

  9. Codd’s insight • A database is nothing more than a “fact store” from which it should be possible to logically infer “new facts.” • Simple but amazingly powerful. • If the goal is to store facts, then there is no benefit from storing the same fact multiple times or in multiple forms. A fact does not become MORE TRUE by repeating it. The basis of “normalization.”

  10. Codd’s insight • If the system knows enough about the data it is storing, you can ASK IT QUESTIONS rather than TELL IT WHAT TO DO. • Declarative vs Procedural programming model • AKA, The nail in the coffin for all the Pre Relational systems • Just a matter of time - If a system is easier to use, and performs fine, why wouldn’t you use it?

  11. Codd’s terminology has (for the most part) become replaced by the SQL terminology, which we’ll be using throughout the rest of this talk • Generalized simplification • “Logically organize your data into tables”

  12. Going further, Codd defined 12 “rules” that he hoped to define what it meant to be “relational” • Key points are • All information is represented in tables • Nulls must be uniformly handled by all datatypes • Physical representation must be abstracted from logical representation • Physical location and distribution of data must be invisible to users

  13. Key points are • “Set based” operations for insert / update / delete • Integrity constraints must be enforceable by the database system • There must be no way AROUND the set of enforced constraints • There must be support for at least 1 “relational language”

  14. Codd’s work became the basis for a “next generation” database product at IBM called System R. • System R was treated as a “production proof of concept.” At the end of the project there were several commercial customers. • Around the same time, work was going on at UC Berkeley on the “Ingres” system, also based on Codd’s idea.

  15. Neither system was successful at commercializing a general purpose database. • The award goes to Oracle. • Oracle shipped a working commercial RDBMS to anyone who would pay before IBM. • Based also on Codd’s work. • No common code between System R, Ingres, and Oracle - 3 unique lineages all based on the same idea

  16. IBM evolved the System R “prototype” into their second system : DB2. • Ingres went on to be the basis of numerous successful commercial products • Sybase was based on Ingres code • Informix contains Ingres code (through Illustra) • MSSQL contained Ingres code (through Sybase)

  17. Newer systems - all inspired by the same ideas and following the same principles, but without direct code sharing • Mysql • SQLite • PostgreSQL

  18. Internals • Buffer Pool • Log • Concurrency Control • Btrees • Relational Layer

  19. Buffer Pool • Often known as “the cache” • A page/block oriented data structure • A page in the pool conceptually “maps” to a block on a disk. (not really always true) • Needs to interface with the systems BELOW and the systems ABOVE. • Below - Disks, File systems • Above - Btrees

  20. Both page/block oriented interfaces above and below. • Conceptually, very similar to the VM subsystem of any modern UNIX • “demand paging” • Eviction policy based on LRU approximations, often with more “smarts” than VM. • Higher levels of the system often can pass down “hints” about intended access patterns all the way to the buffer pool.

  21. Buffer Pool - Why? • Whats the story? it’s a cache, we get it. • Much more than that going on here!

  22. Basics of transaction management begins with the buffer pool and the policies and protocols enforced there • Terminology • “pinned” - a page that cannot be evicted • “dirty” - a page that contains data that DOES NOT match the data on the disk • “clean” - the opposite

  23. A dirty page BECOMES a clean page when the data in that page is DURABLY written to the disk • Can we really just write a page to the disk? Not really, it usually involves logging protocols - wait for the next section!

  24. More terminology • “forcing” - when a transaction commits, it’s dirty pages are FORCED to durable storage before considering the commit complete • “stealing” - A dirty page which is a part of an UNCOMMITTED transaction can be written to the disk in an effort to produce usable space in the buffer pool

  25. What is the simplest? • FORCE / NOSTEAL • What is the highest performing and most powerful? • NOFORCE / STEAL • Not surprisingly, most real world systems today implement a NOFORCE / STEAL buffer pool policy • Support for this policy requires logging

  26. More terminology • OVERWRITE / NO OVERWRITE • Whether or not the buffer pool will write changes to a page ON TOP of an existing page, or leave the existing page alone and write to a NEW page. • OVERWRITE systems are higher performing • most real world systems implement an OVERWRITE buffer pool. • NO OVERWRITE example: System R, shadow paging

  27. How does data actually get written to the disk? • The “clients” of the buffer pool (the layers above) never concern themselves with writing data. They work at a layer of abstraction where they “get buffer” and “dirty” them. • Pages get written out (cleaned!) as part of a background process. • Goal is to keep some portion of the buffer pool clean.

  28. Why are we trying to keep writing out these pages to disk in the background? • To make the system more reliable? • NO! Completely unrelated. Reliability ensured through other means • To make sure that a READ doesn’t become a WRITE! • Need a page? Cant get one, all dirty. • You get to “clean one” (write it) now!

  29. Logging • Basic Idea behind logging • Before you do something, write down what it is you intend to do. • Sounds slow. Why bother with this, just DO IT! • Nope - The opposite is true. Logging can make things quicker

  30. The highest performing buffer pool policy of NOFORCE/STEAL actually REQUIRES logging • Without logging you would compromise with a lower performing policy • Logging has the capacity to perform “magic” • Converts RANDOM (slow) I/O into SEQUENTIAL (fast) I/O! • We’ll come back to this idea

  31. Expanding on the basic idea of logging • Theres really two distinct things that you are “writing down” here • Write down what it is you are about to do: REDO logging - can “do it over” • Write down the procedure to follow to make it as if what you did NEVER HAPPENED: UNDO logging • Many times both of these pieces of information are embedded into single “log record” Or not. Conceptually 2 things.

  32. Mechanics of logging - What’s the data structure? • In it’s basic form, a log is a simple sequential file. Conceptually it’s not unlike a tape drive. • Each “record” in the log is identified by a unique identifier, which is typically just the physical location of the record in the file. • Call this the Log Sequence Number (LSN)

  33. “Log Buffer” - exactly what it sounds like - a buffer of memory in front of the log. • An obvious and common optimization to make it less expensive to “write to the log” • Recoverability is endangered unless the log exposes an interface to FLUSH THE BUFFER. (and it gets called at the right places) • All real systems work this way

  34. Subsystems are said to “generate log records” (calling APIs provided by log subsystem) • Buffer pool may need to log the allocation of a new page • Btree may need to log a page split • Relational layer may log an INSERT statement • Customers of this subsystem all over the database

  35. 2 approaches to logging • “Physical logging” • “Logical logging”

  36. Physical logging • Log entire page images • “redo record” : “log what the page is GOING TO look like” • “undo record” : “log what the page LOOKS LIKE NOW” • Problems? • Inefficient, expensive • Poor concurrency

  37. Problems • Inefficiency mess • Why log 2 copies of a page when I only changed a few bytes? • Concurrency mess • Systems with concurrency control at a finer granularity than the page cannot log this way. We’ll come back to that.

  38. On the other hand • Physical logging is appealing because it is simple, and it works because of a nice property of being “testable” • We can look at a log record ABOUT a page, then look at the page, and determine which state it’s in because we RECORDED the two possible states • This turns out to be an essential property of reovery

  39. Logical logging • Log the high level operations only • SQL • INSERT INTO TBL(A) VALUES(1) • REDO • “INSERT INTO TBL(A) VALUES(1)” • UNDO • “DELETE FROM TBL WHERE A = 1”

  40. Elegant! • Simple! • Compact! • but it doesn’t WORK! • That SQL INSERT could decompose into dozens of page writes. • Some may have been done, then crash. You can’t look at thee pages and tell which ones were done (UNDO THEM) and which ones weren’t

  41. NO FORCE allows us to mark a transaction “committed” WITHOUT writing all of the pages. • Some may have been written, then crash • We can’t tell which one WERE NOT written (REDO them) and which ones WERE (leave them alone) • It’s often UNSAFE to perform actions multiple times

  42. Making logical logging work - “Physiological Logging” • “Physical ABOUT pages, logical ABOUT the contents INSIDE the page” • The idea is to keep the logging centered on the idea of pages, which works well • But log less information than a physical scheme would require

  43. Example Physiological Operation • “Add item X to page N” • Push down the logical concept into the page level - logical INSIDE the page • SQL INSERT statement will decompose into several independent physiological operations • Each one is INDEPENDENTLY TESTABLE / UNDOABLE / REDOABLE • AKA, “it works”

  44. Logging for purposes of recovery • Key technique is based on something called the “pagelsn” • Intertwining of the buffer pool and the logger • Each time you modify a page, store the LSN of the log record describing that modification ON THE PAGE ITSELF • Testability • Look at the pagelsn to determine state

  45. Write Ahead Logging (WAL) Protocol • Tightly integrated with buffer pool • Before a dirty page is written to disk, the UNDO information for that page must be durable • Before a transaction is considered committed, the REDO information for that transaction’s pages must be durable • And that’s how a NO FORCE / STEAL system can convert random I/O into sequential

  46. Basic idea behind recovery after crash • REDO all COMMITTED transactions • Some pages MAY NOT be written • as allowed by NO FORCE • UNDO all UNCOMMITTED transactions • Some pages MAY HAVE BEEN written • as allowed by STEAL

  47. Concurrency Control • Lets talk about ACID now (finally?) • We’ll use Chris Date’s definition

More Related