1 / 31

Chapter 5 – Managing Files of Records

Chapter 5 – Managing Files of Records. What’s Up for This Chapter?. This Chapter’s Material Accessing records in files Record structures for access File access methods vs. file organizations Some real-world examples of file structures File portability issues. The Central Problem.

hila
Download Presentation

Chapter 5 – Managing Files of Records

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 5 – Managing Files of Records

  2. What’s Up for This Chapter? • This Chapter’s Material • Accessing records in files • Record structures for access • File access methods vs. file organizations • Some real-world examples of file structures • File portability issues

  3. The Central Problem • Locating Stored Data • Once the data has been stored into a file, how do you find it to retrieve it? • What does “find the data” even mean? • How do you decide what you want to find? • How do you look for it? • What if it’s not there? • What if something very much like it is there? • What if there are lots of “it” there? • And, of course, there are efficiency considerations • How fast is your search algorithm? • What would you have to do to the file to use a faster one? • Which will you do more often, add records or find them? • Bringing you back to the design of the file itself

  4. Record Keys • What Is a Key? • Data stored in a record by which you look for the record • Can be one field or a set of fields • Examples – { name } or {last name + first_name } • Two Types of Keys • Primary key • Key value, unique in entire file, by which an individual record can be located or determined to be absent • Secondary key • Key value by which one or more records can be located

  5. Primary Keys • Required Characteristics • Unique across the entire file • Can never have 2 records with same primary key • Error to try to add record with duplicate primary key • In “canonical” form • Format precisely known, so search candidates can be brought into that same format before the search • Example – words (names, etc.) in all upper-case • Not often used any more: rather, program the system to do the search independently of case • Unchanging • Value for given record should never change • Given primary key value should always identify same record • Example – Texas Driver’s License number stays with you, even if you move away from Texas, then come back

  6. Primary Keys, cont’d. • Implication on File Design • Don’t use possibly non-unique field(s) as primary key • Bad – name, birth date, etc. • Don’t use anything that can possibly change • Bad – name, address, etc. • What can we use? • Best – artificial identifier • Student number • Driver’s license number • Other artificially created unique value

  7. Secondary Keys • Not Such Stringent Rules • Duplicates allowed • Still have to define what “find” means if duplicates allowed • Usually real data, as opposed to primary keys • The kinds of thing you’d want to search for in real life • Not used to impose any order on the file • Can return results based on secondary key(s) • Selected by secondary key value(s) • Sorted on secondary key value(s)

  8. Searching • From 2325 – Two Major Methods • Sequential • Start at beginning, look until you find what you’re after • Choices: • Non-unique keys allowed? • Return first match or all of them? • Binary • Start in middle, remove half the list each time through • Requires: • Primary key values unique across file • File sorted on primary • Records directly accessible • There are others, but …

  9. Sequential Searching • Performance • It might take 1 try; it might take N tries • Average number of tries = N / 2 if: • Searching on a unique key • Returning first match • Average number of tries = N if: • Returning all matches

  10. Sequential Searching • Performance • Big factor in disk access • Worst case: • File fragmented around the disk • Each program read takes one physical read • Best case: • File fairly contiguous on disk • I/O System buffers things so very few (1?) actual reads are done • In multi-user OSs, this seldom happens • However: • If read/write head didn’t move between accesses • Rotational latency & transfer times small compared to seek time • Multiple physical reads wouldn’t have as much of an impact • However, most OSs are multi-tasking now • Can’t rely on read/write head’s being where you left it • Must assume N physical reads take N full disk accesses

  11. Improving Sequential Searches • Reduce Number of Physical Reads • We can’t do anything about: • File fragmentation • If file’s clusters scattered around disk, multiple seeks are necessary • Multi-tasking environment • Have to assume each program read causes a physical read • (May not be true, if I/O System has good internal caching) • So what do we do? • Increase the number of records pulled in by each physical read • Saw this with magnetic tape – group the records into blocks • Similar to way we collected fields into records, but … • Grouping fields into records is dependent on data characteristics • Grouping records into blocks is dependent on I/O system & disk • Block size should be: • Multiple of disk sector size • Compatible with I/O System’s ability to read

  12. When to Use Sequential Searching • Sequential Searching is Good for: • Text files where you’re looking for a pattern • Unix ‘grep’ (general regular expression processor) command • Small files • Like you use in labs here • Files that are searched very infrequently • Not worth the effort to sort to make binary search work • When you expect a large number of matches • Example – searching on a secondary key • It’s Not so Good for: • Binary files • Sorted files • Big files

  13. Unix Tools for Sequential Access • cat • Seen this one – concatenate files • cat F1 F2 >F3 • wc • Word count (also character & line count) • wc article.txt • grep • Search file for occurrences of regular expression pattern • grep “Ames" personlist.txt • od • Octal dump – or hex, or … • od -ch list.dat

  14. Direct Access • What is it? • Go straight to the record you want in the file • No searching • No unnecessary disk accesses • What’s its “order”? • Time to find a record is independent of number of records • However, it can be harder to do

  15. Direct Access • How to Do It? • At I/O System level, seek to record • C++ seek operations go to relative byte address (RBA) in file • Variants: • Seek with “get” pointer vs. seek with “put” pointer • Relative to start or end of file (default: start) • But that still doesn’t answer the question • How do we know what RBA a particular record starts at? • We’ve talked about index files – but that’s for later • We could move the problem up one level • Use relative record number (RRN) • But that’s no real help • Still need some kind of index – way to find record’s RRN • Also requires use of fixed-length records: RBA = RRN * Record_Size (assuming, of course, that the first RRA is 0)

  16. Building a File of Records • Like Building a Record of Fields • Same problem, up one level • Fixed-length or specified-length records? • How to directly access records? • But wait – there’s more: • Want to require software to know as few details about file as possible • To do that, those details need to be stored with (in) the file • File header records • Store file-specific information at start of file • Header record format • Constant across all file types within one system • Why?

  17. File Header Records • Things a Header Record Might Contain • File structure • Type of record structure • Number of data records • Length of records (if fixed-length) • Record delimiter (if delimited) • Record structure (if records have consistent structure) • Number of fields • Length of each field or delimiter between each field • Format of each field • Key information – if needed • Primary key field • Secondary key field(s), if any • Date/time of most recent access • Date/time of most recent update

  18. File Header Records, continued • Header Record Format • Binary or character? • Depends – is it important for people to read it? • Here’s a place where HTML-style format might work • Lets files of different formats have different headers (in some ways) • Only invokes that parse overhead once per file

  19. What’s the Difference? • File Organization • Format of the file itself • Fixed-length, specified-length, or delimited records • ASCII or binary character encoding • File Access Method • Way(s) software can get at contents of file • Sequential vs. direct • Indexed sequential

  20. Designing a File • Access Affects Organization • If sequential access is all we need • Pretty much any organization is OK • Subject, of course, to application needs • If we need direct access • Need fixed-length records • Can also use indexed files, but that’s for later on • But Organization Also Affects Access • What if data to be stored in a record is wildly variable? • Fixed-length records would be extremely wasteful • But if we use specified-length records, how to do direct access? • Just about have to use indexing then

  21. Metadata • Data About Data • Usually in the form of a file header • Example in text • Astronomy image storage format • HTML format (name = value) • But look on page 177: coding style makes a BIG difference • Parsing this kind of data • Read field name; read field value • Convert ASCII value to type required for storage & use • Store converted value into right variable • Why use this type of header?

  22. More Metadata • PC Graphics Storage Formats • Data • Color values for each pixel in image • Data compression often used (GIF, JPG) • Different color “depth” possibilities • Metadata • Height, width • Number of bits per pixel (color depth) • If not true color (24 bits / pixel) • Color look-up table • Normally 256 entries • Indexed by values stored for each pixel (normally 1 byte) • Contains R/G/B values for color combination • Formatted to be loaded directly into PC graphics RAM

  23. Mixing Data Objects in a File • Objective • Store different types of data in the same file • Textbook example – mix of astronomy data • “File” header (HTML-style) • “File” of notes – lines of ASCII text • “File” of image data – in whatever format • So our data file becomes a file of files • Each individual “file” (header, notes, or image) looks like a record in this new “mega-file” • These “mega-records” are of varying length • How do we store the “records” in the “mega-records”? • Could use another level of specified-length record software • Or, …

  24. Our “Mega-File” • Organization Mega-fileHeader NotesSub-file ImageSub-file NotesSub-file ImageSub-file … Notes Header Text line Text line Text line Text line Text line … Text line Terminator line ImageHeader ImageData

  25. More on Our Mega-File • Access • Can we just read it sequentially? • Why or why not? • What if we wanted to skip a notes sub-file? • What if some image didn’t even have a notes sub-file? • Can we access it directly? • What would the header have to include to allow that? • An index of the “records” in the file • We call the entries in that index “tags” • Each tag in the tag list has: • Type of sub-file referred to • Special-case type: end of file • RBA of sub-file in mega-file • Length of sub-file (not necessary, but helpful) • Key information, if any, for sub-file

  26. More on Our Mega-File • Access, continued • So how do we access the mega-file now? • Read and process the header • Get whole-file information • Build in-memory tag table for sub-files • Sequential access • Same as before • May be able to program in some speed-ups from tag table • Direct access • Locate sub-file in tag table • Go right to it

  27. Extensibility • Look at Our “Mega-File” Format Again • Header tells us things about the sub-files: • What kinds of files they are • Where to find them • Files themselves • To the mega-file processor, just random bytes • To the sub-file processor, meaningful information • What if we need a new type of sub-file? • Define a new type of header entry • Extend header processor to understand that entry • Write (or borrow or buy) code to handle new sub-file • Cardinal Rule: • Everything changes –file types, data types, ...

  28. Factors Affecting Portability - 1 • Operating System Differences • Example – text lines • End with line-feed character • End with carriage-return and line-feed • Prefixed by a count of characters in the line • Natural Language Differences • Example – character coding • Single-byte coding – ASCII, EBCDIC • Double-byte coding – Unicode • Programming Language Differences • Pascal can’t directly process varying-length records • Different C++ compilers use different byte lengths for the standard data types

  29. Factors Affecting Portability - 2 • Computer Architecture Differences • Byte order in 16-bit and 32-bit integer values • Big-endian – leftmost byte is most significant • Little-endian – rightmost byte is most significant • Storage of data in memory • Some architectures require values that are N bytes long to start at a byte whose address is divisible by N 0x15 0x32 Big-endian Little-endian interpretation: interpretation: 0x1532 0x3215

  30. How to Port Files • Define Your Format C*A*R*E*F*U*L*L*Y • Once a format is defined, never change it • If you need a new format, add it so as not to invalidate the existing formats • If you need to change a format, add a new one instead, and let programs that need the new version use it • Decide on a standard format for data elements • Text lines • ASCII , EBCDIC, or Unicode? • Which character(s) to end lines? • Binary • Tightly packed or multiple-of-N addressing? • Which “endian”? • You can always write code to convert to & from the standard format on a new language, computer, etc.

  31. The Conversion Problem IBM VAX • Few Environments – can do directly • Many Env’ts. – need intermediate form VAX IBM IBM IBM VAX VAX ... XML IA-32 IA-32 IA-64 IA-64 (or some otherstandard format)

More Related