The Internet is an international network of addressable machines that communicate using the Internet protocol suite commonly known as 'TCP/IP'. The World Wide Web (Web) is a set of interlinked data or documents within the Internet. Connections are made between data and documents via connections between different machines on the internet that host and serve the documents and data.
This section of the course provides some details of how the Web works...
A 'network socket' connects machines so that data can be sent and recieved between them across a network. The normal operation is for a client to contact a server to open the socket for data transfer.
In Python, to open a network socket and send some data, the following code can be used:
import socket
socket_1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
socket_1.connect(("localhost", 5555)) # Address tuple
socket_1.send(bytes("hello world", encoding="UTF-8"))
In this code: the connection is to 'localhost' - the local machine; the 'port number' for the connection is 5555; and, the bytes representing "hello world" in a unicode String are sent.
The following code sets up a server socket to listen on the same port, the server code is set to receive 30 bytes of data and then print this after converting the bytes to a String:
import socket
serversocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
serversocket.bind(('localhost', 5555))
serversocket.listen()
(socket_2, address) = serversocket.accept()
b = socket_2.recv(30)
print(str(b))
Web applications typically open multiple network sockets and data transfers are normally initiated following a communication protocol.
Most computers use TCP/IP when communicating on the Internet. The Internet Protocol (IP) is used to split data into small chunks called "packets" and address them to a specific destination computer. The Transport Control Protocol (TCP) routes packets, and ensures they reach and are reassembled at the destination.
Ports are numerical handles which individual software applications associate with. The computer directs input from the associated port to the specific software. The first 1024 ports are allocated to specific purposes and protocols.
IP addresses are codes uniques to each networked machine. The Domain Name Service (DNS) is a mapping from these codes to a Web address (or host name which is often easier to remember).
To set up client/server software using sockets, it is best to avoid specifying ports already in use and liaise with your local IT team that are likely monitoring network activity for suspicious behaviour.
The World Wide Web (Web) is a client-server system using port 80 and the hyper-text communication protocols:HTTP and HTTPS. When a server gets a request it is usually to send out a Web page - a file stored in a directory on the server and referred to via a URL. The URL comprises: a protocol identifier e.g. "HTTP", "HTTPS", "FTP"; a case sensitive host name, e.g. "www.w3.org"; a case sensitive path to the file on that server, e.g. "/People/Berners-Lee/Overview.html"; and, sometimes a port number to connect to (by default, HTTP connects to port 80 and HTTPS connects to port 443. Different delimeters are used to separate parts of the URL, a complete example marked up as a hyperlink is: https://www.w3.org:443/People/Berners-Lee/Overview.html
Web pages consist of text that is displayed and tags that are not. The tags are formatting details and references to other files like images or scripts that can provide style information, record user interaction and/or make the page dynamic and interactive. The tags are referred to as HTML. HTML files are text files typically saved with the filename suffix ".html" (and sometimes ".htm"). (If the filename is missing from the URL, by default, many servers will send a file named "index.html" if it exists.
A basic webpage:
<!DOCTYPE html>
<HTML lang="en-GB">
<HEAD>
<TITLE>Title for top of browser</TITLE>
</HEAD>
<BODY>
<!-- Page content goes here; this is a comment -->
</BODY>
</HTML>
All the elements are marked up using tags. Each tag starts with the symbol "<" and ends with the symbol ">". Most tags have a paired start and end tag, with the end tag name being the same, but preceded by the symbol "/". What is between the start and end tags is content.
HTML tags can be nested, so HTML can be regarded as having a tree structure which is called the 'Document Object Model (DOM)', where each element is a child of some parent, and each document has a root.
Data within a Web page is sometimes encoded in tables, that is, between a start table tag '<TABLE>' and an end table tag '</TABLE>'. Within these table values are marked up using other tags for rows and column elements.
In the next ABM practical you are guided through the process of parsing HTML and extracting some data from an HTML TABLE.
HTML elements may be given classes (generic groupings) and IDs (names specific to themselves) as attributes. These are declared in the start tag for the element, for example:
<TABLE class="data" id="table_1">
In general it is good practice to separate the content of a Web page from information about how to style it. This is typically done by storing the style information in a separate file called a Cascading Style Sheet (CSS). These are linked to the HTML in the 'HEAD' section with the following tag:
<link rel="stylesheet" href="url_to_css_file">
The CSS file can be located relative to the page in the directory structure by replacing the URL with the relative file path.
The focus of design should be usability. A key part of this is accessibility. If you are working for a public organisation, accessibility should be a major design driver.
Web pages can be retrieved by issuing HTTP requests. In Python a good option for this is the Requests library that comes with Anaconda.
Once the content of a Web page has been retrieved a helpful library for parsing the HTML is the Beautiful Soup library