1 / 74

COMS E6125 Web-enHanced Information Management (WHIM)

COMS E6125 Web-enHanced Information Management (WHIM). Prof. Gail Kaiser Spring 2012. Today’s Topic. Basic Web Mechanics URI HTTP Client/Server Intermediaries. What is a “URI”?. Uniform Resource Identifier Compact string of characters for identifying an abstract or physical resource

zyta
Download Presentation

COMS E6125 Web-enHanced Information Management (WHIM)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMS E6125 Web-enHanced Information Management (WHIM) Prof. Gail Kaiser Spring 2012 Kaiser: COMS E6125

  2. Today’s Topic • Basic Web Mechanics • URI • HTTP • Client/Server Intermediaries Kaiser: COMS E6125

  3. What is a “URI”? • Uniform Resource Identifier • Compact string of characters for identifying an abstract or physical resource • Conforms to a simple and extensible format • Example:http://www.psl.cs.columbia.edu/courses/whim/ Kaiser: COMS E6125

  4. What is a “Resource”? • Some piece of information that can be identified by a URI • The most common kind of resource is a file • But may also be a dynamically-generated query result, the output of a script, a document available in several languages or formats, etc. Kaiser: COMS E6125

  5. Uniform Resource Identifier • Uniform: aka Universal - same string can be used with same semantic interpretation, even when mechanisms used to access the resource differ • Resource: Conceptual mapping to an entity or set of entities - not necessarily the entity that corresponds to that mapping at any particular instance in time • Identifier: An object that can act as a reference to something that has identity Kaiser: COMS E6125

  6. Key Requirement: Transcribability • May be transcribed from non-network source • Often needs to be remembered by people • Should consist of characters that are most likely to be able to be typed into a computer, within the constraints imposed by keyboards (and related input devices) across languages and locales Kaiser: COMS E6125

  7. Why do we usually say URL rather than URI? • A Uniform Resource Locator (URL) refers to the subset of URIs that identify resources via a representation of their primary access mechanism (i.e., their network “location”) • Most popular form of URI Kaiser: COMS E6125

  8. What’s a URI that’s not a URL? • URN = Uniform Resource Name • Subset of URIs that denote a resource independent of its current location, the name by which it is known, or the mechanism by which it is accessed • Required to remain globally unique and persistent even when the resource ceases to exist or becomes unavailable • Thus not necessarily “retrievable” Kaiser: COMS E6125

  9. URN vs. URL Example • Assume a published book (the resource) • The ISBN (International Standard Book Number) is a 10-digit number that uniquely identifies books and book-like products published internationally - this is the URN • The entire contents of the book might be placed on a Web server at http://www.xyz.com/book.gzand an Ftp server at ftp://ftp.xyz.com/book.gz- both of these are URLs • All of these are URIs Kaiser: COMS E6125

  10. URI Syntax • <scheme>:<scheme-specific-part> • For a URL, the scheme indicates the protocol employed for retrieval (http, ftp, file, mailto, etc.) • More generally, a scheme is a specification for defining the syntax and semantics of the rest of the URI • Extensible because new schemes can be defined, with their own scheme-specific format after the colon (:) Kaiser: COMS E6125

  11. URL Notation • <scheme>://<authority><path>?<query> typically, an Internet domain name specific to the authority, identifies the resource within the scope of the scheme and authority a string of information to be interpreted by the resource Kaiser: COMS E6125

  12. What’s a “domain name”? • Domain Name System (DNS) • Maps domain names to IP addresses and vice versa • Hierarchy of DNS servers for top level domains (.com, .edu, .uk, etc.), second level domains (columbia.edu, ibm.com, etc.), and so on • Eventually finds IP address for individual host (e.g., bank.cs.columbia.edu) • DNS servers cache responses based on TTL = Time to Live • Originated ~1982, e.g., for email (gk60@CMUA -> gk60@CMUA.arpa -> gk60@a.cs.cmu.edu) Kaiser: COMS E6125

  13. Relative URLs • Allows document trees to be independent of their location and scheme • A single set of hypertext documents can be simultaneously traversable via each of the ftp, http and file schemes • Such document trees can be moved, as a whole, without changing any of the relative references • Resolved to full (absolute) URLs using a base URL Kaiser: COMS E6125

  14. Example Relative URLs • http://somehost/absolute/URL/with/absolute/path/to/resource.txt • /relative/URI/with/absolute/path/to/resource.txt • relative/path/to/resource.txt • ../../../resource.txt • resource.txt • /resource.txt#frag01 • #frag01 • [empty string] Kaiser: COMS E6125

  15. URI “Standard” • URI is an Internet protocol element defined currently in RFC 3986 (2005) • Originally RFC1630 (1994) Kaiser: COMS E6125

  16. What is an “RFC”? • Request for Comments • One of a series, begun in 1969, of numbered informational documents and standards followed by commercial software and freeware in the Internet and Unix communities • All Internet standards are recorded in RFCs Kaiser: COMS E6125

  17. Who keeps track of RFCs? • IETF = Internet Engineering Task Force • Open, all-volunteer organization, with no formal membership or membership requirements • Organized into a large number of working groups, each dealing with a specific topic • April 1st RFCs, see http://www.apps.ietf.org/rfc/apr1list.html Kaiser: COMS E6125

  18. What is “W3C”? • World Wide Web Consortium defines data formats and usage conventions as well as Internet protocols relevant to Web • Members pay fees depending on country, revenues and non-profit/for-profit status • Otherwise organized similar to IETF, but writes “Recommendations” instead of “Requests for Comments” • http://www.w3.org/ Kaiser: COMS E6125

  19. Back to URLs • Most Web documents use the “http” scheme (or “https” = http over TLS/SSL) • What is “http” (HyperText Transfer Protocol)? Kaiser: COMS E6125

  20. HTTP = HyperText Transfer Protocol • Most Web documents are accessed using the “http” scheme, the default Internet protocol used to deliver data on WWW • Usually through TCP/IP sockets on port 80, but can use any port and can be implemented on top of any reliable networking protocol • A Web browser (HTTP client) sends requests to an Web server (HTTP server), which sends responses back to the client Kaiser: COMS E6125

  21. What’s “TCP/IP”? • IP = Internet Protocol • Delivers individual packets from one host to another, based on their IP address (in IPv4, four 8-bit octets as in 128.59.11.100) • Network routers direct traffic of IP packets • Analogous to telephone numbers (area code plus exchange plus 4 digits plus extension) and postal address (zip code plus street name plus building number plus apartment number) Kaiser: COMS E6125

  22. What’s “TCP/IP”? • TCP = Transmission Control Protocol • Provides an abstraction of reliable, bidirectional connections for the delivery of IP packets to a particular port at a given IP address • The so-called well known ports (< 1024) are reserved for specific protocols (telnet, ftp, smtp, pop3, imap, etc.) • By default, HTTP uses port 80; this can be changed in the URL • http://www.example.com:2012/doc.html • Main alternative is UDP = User Datagram Protocol, no connection, no reliable delivery (used by DNS) Kaiser: COMS E6125

  23. HTTP History • HTTP/0.9 (1990) - simple protocol for raw data transfer • HTTP/1.0 (1996) - allows MIME-like messages, containing meta-data about the resources transferred and modifiers on the request/response semantics • HTTP/1.1 (1999) – lots of practical improvements, e.g., caching policies, chunked encoding, persistent connections • W3C closed activity but IETF still has a working group to revise Kaiser: COMS E6125

  24. What is “MIME”? • Multipurpose Internet Mail Extensions • Standard representation for “complex” message bodies (numerous RFCs since 1993) • Examples include messages with embedded graphics or audio clips, messages with file attachments, messages in Japanese or Russian, signed messages Kaiser: COMS E6125

  25. HTTP Properties • Uses URLs for identifying Web resources • Request-response – always initiated by client to server, the server responds with results • Stateless – each request-response pair independent from every other, so any state information (login credentials, shopping carts, etc.) needs to be encoded somehow Kaiser: COMS E6125

  26. HTTP request Port 80 Processing HTTP Client Response Other port HTTP Request/Response • Web server processes HTTP requests, generally over TCP Port 80 • The request specifies a resource URL • The server parses the URL and processes the request: • Returns a document with its type information • Invokes a program or script, and returns its output • The output (including metadata) is sent back to the client as a response message Kaiser: COMS E6125

  27. HTTP Requests • Small number of request types (GET, POST, HEAD, etc.) • Request may contain additional information, e.g. client info, parameters for forms, cookies, etc. • Consists of a start-line, zero or more headers (one per line), an empty line (CRLF) indicating the end of the header fields, and possibly a message-body Kaiser: COMS E6125

  28. HTTP Responses • Larger number of response codes (200 OK, 404 NOT FOUND) • Message body only allowed with certain response status codes • Includes MIME metadata as well as “payload” (data) Kaiser: COMS E6125

  29. Start Line • HTTP Version (0.9, 1.0, 1.1) • URI • Method (request) or Status Code (response) Kaiser: COMS E6125

  30. Sample HTTP Exchange • To retrieve the file at the URL http://psl.cs.columbia.edu • First open a socket to the host psl.cs.columbia.edu, port 80 (use the default port because none is specified in the URL) • Connect to 128.59.19.127 on port 80 ... ok Kaiser: COMS E6125

  31. Sample • Then, send something like the following through the socket: GET / HTTP/1.1[CRLF] Host: psl.cs.columbia.edu[CRLF] Connection: close[CRLF] User-Agent: Web-sniffer/1.0.37 (+http://web-sniffer.net/)[CRLF] Accept-Encoding: gzip[CRLF] Accept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7[CRLF] Cache-Control: no-cache[CRLF] Accept-Language: de,en;q=0.7,en-us;q=0.3[CRLF] Referer: http://web-sniffer.net/[CRLF] [CRLF] Kaiser: COMS E6125

  32. Sample • The server should respond with something like the following HTTP Status Code: HTTP/1.1 403 Forbidden[CRLF] Content-Length:218[CRLF] Content-Type:text/html[CRLF] Server:Microsoft-IIS/6.0[CRLF] X-Powered-By:ASP.NET[CRLF] Date: Sat, 22 Jan 2011 14:024:22 GMT[CRLF] Connection:close[CRLF] <html><head><title>Error</title></head><body><head><title>Directory Listing Denied</title></head>[LF] <body><h1>Directory Listing Denied</h1>This Virtual Directory does not allow contents to be listed.</body></body></html> Kaiser: COMS E6125

  33. Some Request Headers • User-Agent: identifies the program that's making the request, in the form "Program-name/x.xx", where x.xx is the alphanumeric version of the program (e.g., browser) • User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9) Gecko/2008052906 Firefox/3.0 • Referer: the URL of the previous webpage from which a link was followed • Referer: http://web-sniffer.net/ Kaiser: COMS E6125

  34. Some Response Headers • Server: analogous to User-Agent:, identifies the server software in the form "Program-name/x.xx" • Server: Apache/2.2.8 (Ubuntu) • Last-Modified: gives the modification date of the resource that's being returned, e.g., for use in caching • Use Greenwich Mean Time, in the format Last-Modified: Sat, 22 Jan 2011 14:46:32 GMT Kaiser: COMS E6125

  35. HTTP URIs • Up to some bounded length (often 255), or “unbounded”, status code 414 (Request-URI Too Long) • Equivalence comparison http://abc.com:80/~smith/home.html http://ABC.com/%7Esmith/home.html http://ABC.com:/%7esmith/home.html Kaiser: COMS E6125

  36. Request Messages • Method SP Request-URI SP HTTP-Version CRLF • GET http://www.gailkaiser.org • Equivalent to client making TCP connection to bank.cs.columbia.edu on port 80, then sending GET / Host: www.gailkaiser.org • Host field allows for virtual hosts Kaiser: COMS E6125

  37. What is a “virtual host”? • Enables the same machine to host multiple domain names, sometimes at the same IP address (name-based virtual hosting) • Important for website hosting (e.g., www.foo.com maps to /www/foo/site1 and www.bar.com maps to /www/bar/site2), but usually there can be only one secure https website per IP address/port Kaiser: COMS E6125

  38. GET • Retrieve whatever information (in the form of an entity) is identified by the URL • If the URL refers to a data-producing process, it is the produced data (given the input parameters after the “?”, if any) that is returned as the entity in the response - not the source text of the process (unless that text happens to be the output of the process) http://foo.com/run.cgi?name1=val1&name2=val2 Kaiser: COMS E6125

  39. Conditional and Partial GET • Conditional if the request message includes an If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range header field • Partial if the request message includes a Range header field • Don’t retrieve data the client doesn’t need (e.g., at least the part already up to date in cache) Kaiser: COMS E6125

  40. HEAD • Identical to GET except that the server must not return a message-body in the response - only returns headers • Often used for testing hypertext links for validity and modification • Can mark cache entries as stale if certain header information changes (e.g., length, last-modified) Kaiser: COMS E6125

  41. POST • Used to request that the server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line • Actual function performed by the POST method is determined by the server, usually dependent on the Request-URI Kaiser: COMS E6125

  42. POST supports several functions • Annotation of an existing resource • Posting a message to a bulletin board, newsgroup, mailing list, or similar group of articles • Providing a block of data, such as the result of submitting a form, to a data-handling process • Extending a database through an append operation Kaiser: COMS E6125

  43. POST vs. GET • GET can only be used to send relatively small amounts of data to a server, with the data following the ? character • The rest of the request-URI (before the ?) refers to some kind of processing program GET /run.cgi?name1=val1&name2=val2 HTTP/1.0 Kaiser: COMS E6125

  44. PUT and DELETE • Often unsupported (501 Not Implemented) • PUT requests that the enclosed entity be stored under the supplied Request-URI • May create a new resource at a new URI, or modify an existing resource already at that URI • DELETE requests that the origin server delete the resource identified by the Request-URI • May be overridden, e.g., by human intervention, even if status code indicates successfully completed • Effectively supplanted by WebDAV Kaiser: COMS E6125

  45. OPTIONS and TRACE • OPTIONS allows the client to determine the requirements associated with a resource, or the capabilities of a server (OPTIONS *), without implying a resource action or initiating a resource retrieval • TRACE used to invoke application-layer loop-back of the request message, allowing the client to see what is being received at the other end of the request chain for testing or diagnostic information Kaiser: COMS E6125

  46. HTTP Responses • HTTP-Version SP Status-Code SP Reason-Phrase CRLF • Example: HTTP/1.0 404 Not Found • Status code: 3-digit integer result code of the attempt to understand and satisfy the request • Response phrase: short textual description of the Status-Code Kaiser: COMS E6125

  47. Response Messages • Larger number of response codes (200 OK, 404 NOT FOUND) • Message body only allowed with certain response status codes • Includes MIME metadata as well as “payload” (data) Kaiser: COMS E6125

  48. Status Codes • Applications need only understand first digit, treat others as equivalent to x00 • 1xx: Informational - Request received, continuing process ("100" : Continue, relevant to persistent connections in HTTP 1.1) • 2xx: Success - The action was successfully received, understood and accepted ("200" : OK) • 3xx: Redirection - Further action must be taken in order to complete the request ("300" : Multiple Choices) • 4xx: Client Error - The request contains bad syntax or cannot be fulfilled ("400" : Bad Request) • 5xx: Server Error - The server failed to fulfill an apparently valid request ("500" : Internal Server Error) Kaiser: COMS E6125

  49. HTTP is “Stateless” • Server doesn’t remember anything about client between connections • Not even between requests during the same persistent connection, except TCP data • So how does HTTP support “remembering” the user during a session or across sessions? • Some state can be encoded in complex URLs or otherwise in the web page itself (e.g., query strings added to links, hidden form fields) • Or saved on client in “cookies” Kaiser: COMS E6125

  50. Cookies • String associated with a name/domain/path, stored at the browser • Series of name-value pairs, interpreted by the web application • Create in HTTP response with “Set-Cookie:” (or “Set-Cookie2:”) • In all subsequent requests to this site, until cookie’s expiration, the client sends the HTTP header “Cookie:” (or “Cookie2:”) • Often have an expiration (otherwise expire when browser closed) • Various technical, privacy and security issues (e.g., inconsistent state after using “back” button, third-party cookies, cross-site scripting) Kaiser: COMS E6125

More Related