FEUP, João Neves 2013 World Wide Web [email protected] Before WWW Major search tools: Gopher and Archie Archie • Search FTP archives indexes • Filename based queries Gopher • Friendly interface • Menu driven queries João Neves 2 1 FEUP, João Neves 2013 Web Born Tim Berners-Lee et al. at CERN in 1991 HyperText Transfer Protocol (HTTP) Hypertext - embedded links in text to link to another text document Hyperlinks RFC 1945, May 1996, HTTP/1.0 RFC 2068 obsolete by RFC 2616, June 1999, HTTP/1.1 João Neves 3 Internet Evolution Ano 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 João Neves Hosts (*) 562 1024 1961 2308 5089 28174 80000 290000 500000 727000 1200000 2217000 4852000 4 2 FEUP, João Neves 2013 Total Sites Across All Domains August 1995 - March 2008 Source http://news.netcraft.com/archives/web_server_survey.html João Neves 5 Layering HyperText Transfer Protocol Simple Network Management Telnet Transmission Control Protocol (TCP) Dynamic Host Configuration User Datagram Protocol (UDP) Internet Protocol (IP) Ethernet João Neves Wi-Fi SONET 6 3 FEUP, João Neves 2013 HTTP Standard protocol for web transfer Request-response interaction between client and server The server has resources as HTML files and images Request methods: GET, HEAD, PUT, POST, DELETE, … Response: Status line + additional info (e.g., a web page) João Neves 7 Introduction to HTTP It has been in use by the World-Wide Web global João Neves information initiative since 1990 Its first version (referred to as HTTP/0.9) was a simple protocol for raw data transfer across the Internet HTTP/1.0 improved the protocol by allowing messages to be in the format of MIME-like messages: • containing metainformation about the data transferred and • modifiers on the request/response semantics 8 4 FEUP, João Neves 2013 HTTP Transaction HTTP Server Client HTTP client: web browser WebRoot HTTP server: web server Standard port: 80 dir Suggested alternate ports: 81, 8080, 8081 file.html HTTP is used to transmit resources • File/documents • Image files • Query results • Outputs from CGI scripts • Anything that can be identified by a URL João Neves 9 Web Clients Lynx 2.0 (1993, character based interface) NCSA Mosaic (1993, first with graphical interface) Marc Andreessen (author of Mosaic) moved to Netscape Microsoft Internet Explorer (“new name for Mosaic…”) Mozilla Firefox Opera Safari Chrome … João Neves 10 5 FEUP, João Neves 2013 The Browser The browser 1. fetches the page requested 2. interprets the text and formatting commands that it contains 3. displays the page properly formatted on the screen On the page strings of text that are links to other pages, called hyperlinks • On the screen the hyperlinks are highlighted, either by underlining, displaying them in a special color, or both João Neves 11 Web Servers NCSA HTTPd non-commercial free Apache HTTP Server freeware Apache Tomcat freeware lighttpd freeware Microsoft Internet Information João Neves Services (IIS) Zeus Web Server Zope ... payware payware freeware 12 6 FEUP, João Neves 2013 Server Share Server Share amongst the Million Busiest Sites, March 2009 source http://news.netcraft.com/archives/web_server_survey.html João Neves 13 Markup Languages HTML SHTML SGML XML … João Neves 14 7 FEUP, João Neves 2013 Markup “Markup” are codes inserted into texts documents to manage formatting, printing or other process. A description markup indicates the nature, function, or content of the data in a file. A procedural markup defines what processing is to be carried out at particular points in the document. João Neves 15 HyperText Markup Language Language in which web pages are written Contains formatting commands Tells browser what to display and how to display Examples: <TITLE> Welcome to My Great Site </TITLE> • The title of this page is “Welcome to My Great Site” <B>Great News!</B> <A HREF=”http://www.xptoo.org/”>I’m the One</A> • • João Neves Set “Great News!” in boldface A link pointing to the web page http:// www.xptoo.org/index.html with the text “I’m the One” displayed 16 8 FEUP, João Neves 2013 Sample HTML Tags João Neves <A> </A> Anchor link or name <BODY> </BODY> Document Contents <BR> Break <FORM> </FORM> Input form <H1> </H1> Heading level 1 <HEAD> </HEAD> Header of a document <HR> Horizontal Rule <HTML> </HTML> The doc type is HTML <LI> List Item <OL> </OL> Ordered List <P> </P> Paragraph break <PRE> </PRE> Preformatted text <TITLE> </TITLE> Document title <UL> Unnumbered list 17 Uniform Resource Identifiers RFC 2396, August 1998 A URI is an identifier for some resource, and a Uniform Resource Locator (URL) gives you specific information as to obtain that resource HTTP is also used as a generic protocol for communication between user agents and proxies/gateways to other Internet systems, including those supported by the next protocols: • João Neves SMTP, NNTP, FTP In this way, HTTP allows basic hypermedia access to resources available from diverse applications 18 9 FEUP, João Neves 2013 Uniform Resource Identifiers The following examples illustrate URL that are in common use: Name Utility Example ftp ftp scheme for File Transfer Protocol services ftp://ftp.is.co.za/rfc/rfc1808.txt http http scheme for Hypertext Transfer Protocol services http://www.math.uio.no/faq/compressionfaq/part1.html file Local file file:/usr/local/etc/ntp.conf news news scheme for USENET news groups and articles news:comp.infosystems.www.servers.unix telnet telnet scheme for interactive services via the TELNET Protocol telnet://melvyl.ucop.edu/ mailto mailto scheme for electronic mail addresses mailto:[email protected] gopher gopher scheme for Gopher and Gopher+ Protocol services gopher://stap.umn.edu/00/Weather/Ca/Los%20Angeles João Neves 19 Uniform Resource Locator <scheme>: // [userinfo @] hostname [: port] / path [; parameters] [?query] João Neves Some URL schemes use the format "user:password" in the userinfo field. This practice is NOT RECOMMENDED, because the passing of authentication information in clear text (such as URI) has proven to be a security risk in almost every case where it has been used. [RFC2396] 20 10 FEUP, João Neves 2013 HTTP HyperText Transfer Protocol A very simple, stateless protocol for sessionless exchanges • Browser creates a new connection each time it wants to make a new request (for a page, image, etc.) Exceptions: • HTTP 1.1 added support for persistent connections and pipelining • Clients + servers might keep state information • Cookies provide a way of recording state João Neves 21 The http protocol: more http: TCP transport service client initiates TCP connection (creates socket) to server, port 80 server accepts TCP connection from client http messages (application-layer protocol messages) exchanged between browser (http client) and Web server (http server) TCP connection closed João Neves http is “stateless” server maintains no information about past client requests 22 11 FEUP, João Neves 2013 HTTP GET /path/to/file/index.html HTTP/1.0 HTTP method Path: the part of the URL after the hostname, i.e. request URI The HTTP version João Neves 23 jneves@bart(1)$ telnet www.inescporto.pt 80 [...] GET /~jneves/index.html HTTP/1.0 From: [email protected] User-Agent: Camachina/5.0 HTTP Session HTTP/1.1 200 OK Date: Tue, 26 May 2009 18:06:13 GMT Server: Apache/2.30 (Unix) PHP/5.5 DAV/2 mod_perl/2.9 Perl/v5.20 Last-Modified: Fri, 04 May 2007 18:41:20 GMT Accept-Ranges: bytes Content-Length: 91 Connection: close Content-Type: text/html <html> <head> <meta HTTP-EQUIV="REFRESH" content="0; url=./index.shtml"> </head> </html> Connection closed by foreign host. João Neves 24 12 FEUP, João Neves 2013 HTTP Request Headers Header Description From RFC822 E-mail address of the user User-Agent Client Software Accept File types that client will accept, e.g., text/plain, text/html Accept-encoding Compression methods, e.g., x-compress; x-zip Accept-Language Language(s) used Referrer (optional) URL of the document (or element within the document) from which the URL in the request was obtained If-Modified-Since Return document if modified since specified date Content-length Length in octets of data to follow Content-Type Type of the item Pragma: no-cache Directive understood by a proxy server; When present the proxy should not return a document from the cache João Neves 25 HTTP Response Headers Header Description Server Server Software Date Current Date Last-Modified Modification date of the document Expires Document expiration date Location The location of the document in redirection responses Pragma A hint, e.g. no cache MIME-version João Neves Link URL of document’s parent Content-Length Length in octets Allowed Requests that user can issue, e.g., GET 26 13 FEUP, João Neves 2013 HTTP Status Codes Code Text 2xx Success 3xx Redirection 301 Moved 302 Found 4xx Client Errors 400 Bad Request 401 Unauthorized 404 Not found 5xx Server Errors 500 Internal Error 502 Service Overload João Neves 27 HTTP over TLS bash-4.0# openssl s_client -connect secure.xptoo.org:443 -showcerts CONNECTED(00000004) […] --GET / HTTP/1.0 HTTP/1.1 200 OK […] João Neves 28 14 FEUP, João Neves 2013 HTTP 1.1 Features Persistent TCP Connections: remain open for multiple requests Partial Document Transfers: clients can specify start and stop positions Conditional Fetch: several additional conditions Better content negotiation More flexible authentication João Neves 29 Static vs. Dynamic Pages HTML pages vs. database Personalized Context-aware services Browsing Device-dependent João Neves 30 15 FEUP, João Neves 2013 HTTP Proxy An intermediary program which acts as both a server and a client for the purpose of making requests on behalf of other clients; Requests are serviced internally or by passing them on, with possible translation, to other servers; A proxy must implement both the client and server requirements of this specification; The client makes a request to the proxy server using the complete URL; The proxy server connects to the remote server and requests the resource relative to that server (no protocol and hostname in the URL). João Neves 31 HTTP Proxy GET http://hostname/path/to/file.html HTTP/1.0 GET /path/to/file.html HTTP/1.0 HTTP Proxy Server Client Server HTTP/1.0 200 Document .... HTTP/1.0 200 Document .... The client makes a request to the proxy server using the complete URL; The proxy server connects to the remote server and requests the resource relative to that server (no protocol and hostname in the URL). João Neves WebRoot dir file.html 32 16 FEUP, João Neves 2013 HTTP Proxy + Cache GET http://hostname/path/to/file.html HTTP/1.0 GET /path/to/file.html HTTP/1.0 HTTP Proxy Server Client Server HTTP/1.0 200 Document .... HTTP/1.0 200 Document .... WebRoot dir Cache file.html João Neves 33 HTTP Proxy Transparent Configured (http://proxy.xptoo.org:3128/) Automatic (Web Proxy AutoDiscovery) Advantages vs. disadvantages João Neves 34 17 FEUP, João Neves 2013 Why Web Caching (Proxies)? origin servers Assume: cache is “close” to client (e.g., in same network) smaller response time: cache “closer” to client decrease traffic to distant servers Internet 1,5 Mb/s access link (bottleneck…) institutional network • link out of institutional/local ISP network often bottleneck 10 Mb/s LAN institutional cache João Neves 35 Web Load Handling Thousands of clients Load sharing DNS Round Robin Web Switching L4 L7 – Load Balancing Devices • • • • Nortel Alteon A10 Networks Cisco Content Switching ... Akamai João Neves 36 18 FEUP, João Neves 2013 Bibliography Comer, Douglas E. Internetworking with TCP/IP (VOL I) Prentice Hall, 5th Ed. (2006) ISBN 0-13-187671-6 Tanenbaum, Andrew S. Computer Networks Prentice Hall International Editions 4th Ed. (2003) ISBN 0-13-038488-7 João Neves 37 19