To understand web scraping, we need to understand how web pages work. It is time to look a bit more beyond this four-letter acronym: HTML. Let’s go.
Table of contents
- What is HTML?
- How does HTML work?
- How does HTML structure web pages?
What is HTML?
HTML means Hypertext Markup Language and is the main technology behind web pages. HTML is not a programming language as Python. Instead, it is a so-called markup language. Markup languages are used to explicitly tell a computer how to format or annotate a certain document. For example, a markup language helps tell the computer that a certain text should be in red, and that a featured image should be placed next to the title. The markup language is hidden to the user; the user only sees the formatted content. Yet, for the computer, the markup language is the input to bring the plain document to life.
How does HTML work?
HTML structures a document into logical elements. Elements can be text, tables, images etc. To create such elements, HTML relies on tags.
For example, if we want to create a text element, then we would use a so-called <p> element:
<p>This is text.</p>
The tag <p> is the HTML codeword for a paragraph element. The <p> tag is the beginning element. It tells the browser that a new paragraph of text begins. The </p> marks the end of a paragraph. It tells the browser that no more text follows.
Everything that is placed between the <p> (beginning element) and </p> (end element) is interpreted by the computer as text. HTML expects that elements are closed. That is, an end element is required. Without an end element, the computer would interpret the remainder of the entire document as text.
Elements can have attributes. Attributes specify, for instance, font sizes or identifiers of each element. For example, to make the text centered, we would add the align-tag and specify that it should be center-aligned:
Basic Web Page Structure
All HTML web pages follow the same structure. It looks like this:
<html> <head> <title>This is an example web page</title> </head> <body> <p>An exemplary text paragraph.</p> </body> </html>
Now here’s what this means:
This is the root element. It tells the browser to interpret everything between it as HTML code.
This element contains a number of meta-information on the document. For example, head elements specify keywords that are used by search engines understand the content of your web page. In addition, the head element can contain linkages to external files, like scripts or more advanced formatting libraries. Most of what is listed in the head element is not visible to the user. As scrapers, we are rarely interested in this.
This element contains the content of the web page. All text, images, videos that are to be displayed are placed here. In the example above, the body element contains a paragraph.
The HTML Tree
Generally, each HTML document follows a tree structure. Like in a family tree, HTML documents nest elements in each other to define a structure. There are parent elements, sibling elements, and child elements.
It is important to understand the document tree because it helps us later on to get the data we need. Have a look at the following real-world example.
<html> <head> <title>App Details</title> </head> <body> <div id="stats"> <ul> <li>App: Tower Madness 2019</li> <li>Price:2.00 USD</li> <li>Rating: 4 Stars</li> <li>Downloads: 2,000</li> </ul> </div> </body> </html>
The web page contains the following new elements:
This is a container. It is used as a wrapper around text, lists or images and to help with the alignment of these elements.
This is an unordered list (i.e., bulleted list).
This is a list element.
What we see from the example is that HTML documents typically follow a tree structure for the content. There is a container first (<div), nested between it an unnumbered list (<ul>), nested between it several list elements (<li>). All HTML documents work in a similar way.
Why should we care about the tree structure? Well, the tree structure helps us traverse the document and extract the data we need. Continuing the example above, we can easily tell a software to go to the container with the ID „stats“ and extract the price, rating, and downloads in the list nested within. But more about this later.