April 29 th , 2013 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs194-24

CS194-24Advanced Operating Systems Structures and Implementation Lecture 23Application-Specific File SystemsDeep Archival StorageSecurity and Protection April 29th, 2013 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs194-24

Goals for Today • Application-specific File Systems • Dynamo, Haystack • Deep Archival Storage • OceanStore • Security and Protection Interactive is important! Ask Questions! Note: Some slides and/or pictures in the following are adapted from Bovet, “Understanding the Linux Kernel”, 3rd edition, 2005

Recall: VFS Common File Model • Four primary object types for VFS: • superblock object: represents a specific mounted filesystem • inode object: represents a specific file • dentry object: represents a directory entry • file object: represents open file associated with process • There is no specific directory object (VFS treats directories as files) • May need to fit the model by faking it • Example: make it look like directories are files • Example: make it look like have inodes, superblocks, etc.

Recall: Data-based Caching (Data “De-Duplication”) • Use a sliding-window hash function to break files into chunks • Rabin Fingerprint: randomized function of data window • Pick sensitivity: e.g. 48 bytes at a time, lower 13 bits = 0  2-13 probability of happening, expected chunk size 8192 • Need minimum and maximum chunk sizes • Now – if data stays same, chunk stays the same • Blocks named by cryptographic hashes such as SHA-256

Recall: Peer-to-Peer: Fully equivalent components • Peer-to-Peer has many interacting components • View system as a set of equivalent nodes • “All nodes are created equal” • Any structure on system must be self-organizing • Not based on physical characteristics, location, or ownership

Source 111… 0… 110… Response 10… Lookup ID Recall: Lookup with Leaf Set (Chord) • Assign IDs to nodes • Map hash values to node with closest ID • Leaf set is successors and predecessors • All that’s needed for correctness • Routing table matches successively longer prefixes • Allows efficient lookups • Data Replication: • On leaf set

Advantages/Disadvantages of Consistent Hashing • Advantages: • Automatically adapts data partitioning as node membership changes • Node given random key value automatically “knows” how to participate in routing and data management • Random key assignment gives approximation to load balance • Disadvantages • Uneven distribution of key storage natural consequence of random node names  Leads to uneven query load • Key management can be expensive when nodes transiently fail • Assuming that we immediately respond to node failure, must transfer state to new node set • Then when node returns, must transfer state back • Can be a significant cost if transient failure common • Disadvantages of “Scalable” routing algorithms • More than one hop to find data  O(log N) or worse • Number of hops unpredictable and almost always > 1 • Node failure, randomness, etc

Dynamo Assumptions • Query Model – Simple interface exposed to application level • Get(), Put() • No Delete() • No transactions, no complex queries • Atomicity, Consistency, Isolation, Durability • Operations either succeed or fail, no middle ground • System will be eventually consistent, no sacrifice of availability to assure consistency • Conflicts can occur while updates propagate through system • System can still function while entire sections of network are down • Efficiency – Measure system by the 99.9th percentile • Important with millions of users, 0.1% can be in the 10,000s • Non Hostile Environment • No need to authenticate query, no malicious queries • Behind web services, not in front of them

Service Level Agreements (SLA) • Application can deliver its functionality in a bounded time: • Every dependency in the platform needs to deliver its functionality with even tighter bounds. • Example: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second • Contrast to services which focus on mean response time Service-oriented architecture of Amazon’s platform

Replication • Each data item is replicated at N hosts • “preference list”: The list of nodes responsible for storing a particular key • Successive nodes not guaranteed to be on different physical nodes • Thus preference list includes physically distinct nodes • Sloppy Quorum • R (or W) is the minimum number of nodes that must participate in a successful read (or write) operation. • Setting R + W > N yields a quorum-like system. • Latency of a get (or put) is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency. • Replicas synchronized via anti-entropy protocol • Use of Merkle tree for each unique range • Nodes exchange root of trees for shared key range

Administrivia • Get moving on Lab 4 • Will require you to read a bunch of code to digest the VFS layer • Design due this Thursday! • So that Palmer can have design reviews on Friday • Focus on behavioral aspects • Mounting, File operations, Etc • Don’t forget final Lecture during RRR • Monday 5/6 • Send me final topics

Data Versioning • A put() call may return to its caller before the update has been applied at all the replicas • A get() call may return many versions of the same object. • Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future. • Solution: uses vector clocks in order to capture causality between different versions of the same object • A vector clock is a list of (node, counter) pairs • Every version of every object is associated with one vector clock • If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.

Vector clock example

Conflicts (multiversion data) • Client must resolve conflicts • Only resolve conflicts on reads • Different resolution options: • Use vector clocks to decide based on history • Use timestamps to pick latest version • Examples given in paper: • For shopping cart, simply merge different versions • For customer’s session information, use latest version • Stale versions returned on reads are updated (“read repair”) • Vary N, R, W to match requirements of applications • High performance reads: R=1, W=N • Fast writes with possible inconsistency: W=1 • Common configuration: N=3, R=2, W=2 • When do branches occur? • Branches uncommon: 0.06% of requests saw > 1 version over 24 hours • Divergence occurs because of high write rate (more coordinators), not necessarily because of failure

Haystack File System • Does it ever make sense to adapt a file system to a particular usage pattern? • Perhaps • Good example: Facebook’s “Haystack” filesystem • Specific application (Photo Sharing) • Large files!, Many files! • 260 Billion images, 20 PetaBytes (1015 bytes!) • One billion new photos a week (60 TeraBytes) • Presence of Content Delivery Network (CDN) • Distributed caching and distribution network • Facebook web serversreturn special URLs that encode requests to CDN • Pay for service by bandwidth • Specific usage patterns: • New photos accessed a lot (caching well) • Old photos accessed little, but likely to be requested at any time  NEEDLES Number of photosrequested in day

Old Solution: NFS • Issues with this design? • Long Tail  Caching does notwork for most photos • Every access to back end storagemust be fast without benefit ofcaching! • Linear Directory scheme worksbadly for many photos/directory • Many disk operations to find even a single photo • Directory’s block map too big to cache in memory • “Fixed” by reducing directory size, however still not great • Meta-Data (FFS) requires ≥ 3 disk accesses per lookup • Caching all iNodes in memory might help, but iNodes are big • Fundamentally, Photo Storage different from other storage: • Normal file systems fine for developers, databases, etc

New Solution: Haystack • Finding a needle (old photo) in Haystack • Differentiate between oldand new photos • How? By looking at “Writeable”vs “Read-only” volumes • New Photos go to Writeable volumes • Directory: Help locate photos • Name (URL) of photo has embedded volume and photo ID • Let CDN or Haystack CacheServe new photos • rather than forwarding them to Writeable volumes • Haystack Store: Multiple “Physical Volumes” • Physical volume is large file (100 GB) which stores millions of photos • Data Accessed by Volume ID with offset into file • Since Physical Volumes are large files, use XFS which is optimized for large files

Haystack Details • Each physical volume is stored as single file in XFS • Superblock: General information about the volume • Each photo (a “needle”) stored by appending to file • Needles stored sequentially in file • Naming: [Volume ID, Key, Alternate Key, Cookie] • Cookie: random value to avoid guessing attacks • Key: Unique 64-bit photo ID • Alternate Key: four different sizes, ‘n’, ‘a’, ‘s’, ‘t’ • Deleted Needle Simply marked as “deleted” • Overwritten Needle – new version appended at end

Haystack Details (Con’t) • Replication for reliability and performance: • Multiple physical volumes combined into logical volume • Factor of 3 • Four different sizes • Thumbnails, Small, Medium, Large • Lookup • User requests Webpage • Webserver returns URL of form: • http://<CDN>/<Cache>/<Machine id>/<Logical volume,photo> • Possibly reference cache only if old image • CDN will strip off CDN reference if missing, forward to cache • Cache will strip off cache reference and forward to Store • In-memory index on Store for each volume map: [Key, Alternate Key]  Offset

What about Protection? • Start by asking some high-level questions… • What do we expect of our systems? • Won’t leak our information • Won’t lose our information • Will always work when we need them • Won’t launch attacks against other people • How can we prevent systems from misbehaving? • Never connect them to the network? • Always authenticate users? • Never use them? • Protection:use of one or more mechanisms for controlling the access of programs, processes, or users to resources • Page Table Mechanism • File Access Mechanism • On-disk encryption • Can use lots of Protection but still have an insecure system! • Bugs, back doors, viruses, poorly defined policy, inside man • Denial of service, …

Protection vs Security • Security is a very complex topic: see, i.e. CS161 • Security is about Policy, i.e. what human-centered properties do we want from our system • Usually with reference to an attack model • Security is achieved through a series of Mechanisms, i.e. individual elements of the system combined together to achieve a security policy • Security: use of protection mechanisms to prevent misuse of resources • Misuse defined with respect to policy • E.g.: prevent exposure of certain sensitive information • E.g.: prevent unauthorized modification/deletion of data • Requires consideration of the external environment within which the system operates • Most well-constructed system cannot protect information if user accidentally reveals password

Preventing Misuse • Types of Misuse: • Accidental: • If I delete shell, can’t log in to fix it! • Could make it more difficult by asking: “do you really want to delete the shell?” • Intentional: • Some high school brat who can’t get a date, so instead he transfers $3 billion from B to A. • Doesn’t help to ask if they want to do it (of course!) • Three Pieces to Security • Authentication: who the user actually is • Authorization: who is allowed to do what • Enforcement: make sure people do only what they are supposed to do • Loopholes in any carefully constructed system: • Log in as superuser and you’ve circumvented authentication • Log in as self and can do anything with your resources; for instance: run program that erases all of your files • Can you trust software to correctly enforce Authentication and Authorization?????

Authentication: Identifying Users • How to identify users to the system? • Passwords • Shared secret between two parties • Since only user knows password, someone types correct password  must be user typing it • Very common technique • Smart Cards • Electronics embedded in card capable of providing long passwords or satisfying challenge  response queries • May have display to allow reading of password • Or can be plugged in directly; several credit cards now in this category • Biometrics • Use of one or more intrinsic physical or behavioral traits to identify someone • Examples: fingerprint reader, palm reader, retinal scan • Becoming quite a bit more common • What else? • Consider the “Swarm” and “Un-pad” views

Timing Attacks: Tenex Password Checking • Tenex – early 70’s, BBN • Most popular system at universities before UNIX • Thought to be very secure, gave “red team” all the source code and documentation (want code to be publicly available, as in UNIX) • In 48 hours, they figured out how to get every password in the system • Here’s the code for the password check: for (i = 0; i < 8; i++) if (userPasswd[i] != realPasswd[i]) go to error • How many combinations of passwords? • 2568? • Wrong!

Defeating Password Checking • Tenex used VM, and it interacts badly with the above code • Key idea: force page faults at inopportune times to break passwords quickly • Arrange 1st char in string to be last char in pg, rest on next pg • Then arrange for pg with 1st char to be in memory, and rest to be on disk (e.g., ref lots of other pgs, then ref 1st page) a|aaaaaa | page in memory| page on disk • Time password check to determine if first character is correct! • If fast, 1st char is wrong • If slow, 1st char is right, pg fault, one of the others wrong • So try all first characters, until one is slow • Repeat with first two characters in memory, rest on disk • Only 256 * 8 attempts to crack passwords • Fix is easy, don’t stop until you look at all the characters

Recall: Authorization: Who Can Do What? • How do we decide who is authorizedto do actions in the system? • Access Control Matrix: containsall permissions in the system • Resources across top • Files, Devices, etc… • Domains in columns • A domain might be a user or a group of permissions • E.g. above: User D3 can read F2 or execute F3 • In practice, table would be huge and sparse! • Two approaches to implementation • Access Control Lists: store permissions with each object • Still might be lots of users! • UNIX limits each file to: r,w,x for owner, group, world • More recent systems allow definition of groups of users and permissions for each group • Capability List: each process tracks objects has permission to touch • Popular in the past, idea out of favor today • Consider page table: Each process has list of pages it has access to, not each page has list of processes …

Authorization Continued • Principle of least privilege: programs, users, and systems should get only enough privileges to perform their tasks • Very hard to do in practice • How do you figure out what the minimum set of privileges is needed to run your programs? • People often run at higher privilege then necessary • Such as the “administrator” privilege under windows • One solution: Signed Software • Only use software from sources that you trust, thereby dealing with the problem by means of authentication • Fine for big, established firms such as Microsoft, since they can make their signing keys well known and people trust them • Actually, not always fine: recently, one of Microsoft’s signing keys was compromised, leading to malicious software that looked valid • What about new startups? • Who “validates” them? • How easy is it to fool them?

Mandatory Access Control (MAC) • Mandatory Access Control (MAC) • “A Type of Access control by which the operating system constraints the ability of a subject or initiator to access or generally perform some sort of operation on an object or target.” From Wikipedia • Subject: a process or thread • Object: files, directories, TCP/UDP ports, etc • Security policy is centrally controlled by a security policy administrator: users not allowed to operate outside the policy • Examples: SELinux, HiStar, etc. • Contrast: Discretionary Access Control (DAC) • Access restricted based on the identity of subjects and/or groups to which they blong • Controls are discretionary – a subject with a certain access permission is capable of passing that permission on to any other subject • Standard UNIX model

Data Centric Access Control (DCAC?) • Problem with many current models: • If you break into OS  data is compromised • In reality, it is the data that matters – hardware is somewhat irrelevant (and ubiquitous) • Data-Centric Access Control (DCAC) • I just made this term up, but you get the idea • Protect data at all costs, assume that software might be compromised • Requires encryption and sandboxing techniques • If hardware (or virtual machine) has the right cryptographic keys, then data is released • All of the previous authorization and enforcement mechanisms reduce to key distribution and protection • Never let decrypted data or keys outside sandbox • Examples: Use of TPM, virtual machine mechanisms

Enforcement • Enforcer checks passwords, ACLs, etc • Makes sure the only authorized actions take place • Bugs in enforcerthings for malicious users to exploit • Normally, in UNIX, superuser can do anything • Because of coarse-grained access control, lots of stuff has to run as superuser in order to work • If there is a bug in any one of these programs, you lose! • Paradox • Bullet-proof enforcer • Only known way is to make enforcer as small as possible • Easier to make correct, but simple-minded protection model • Fancy protection • Tries to adhere to principle of least privilege • Really hard to get right • Same argument for Java or C++: What do you make private vs public? • Hard to make sure that code is usable but only necessary modules are public • Pick something in middle? Get bugs and weak protection!

Summary • Peer-to-Peer: • Use of 100s or 1000s of nodes to keep higher performance or greater availability • May need to relax consistency for better performance • Application-Specific File Systems (e.g. Haystack): • Optimize system for particular usage pattern • Security: use of protection mechanisms to prevent misuse of resources • Represents Human-Centered Policy as opposed to mechanism • Three Pieces to Security • Authentication: who the user actually is • Authorization: who is allowed to do what • Enforcement: make sure people do only what they are supposed to do • Principle of least privilege: programs, users, and systems should get only enough privileges to perform their tasks

April 29 th , 2013 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs194-24