We will begin with building our first crawler. The goal is that the crawler takes a static web page as input and stores it on the computer.
Hands-on: Retrieving your first web page
Here’s our template code for requesting code from APIs. Have a look at it and then we will go block by block through it. In this example, we will be working with the Internet Archive API. The Internet Archive API allows us to look for historical versions of a web page. It is a great way of obtaining historical data, more about this in data sources.
#import the packages we need import requests #setup address="https://en.wikipedia.org/wiki/Main_Page" url = ('http://web.archive.org/cdx/search/cdx?url=' + address + '&output=json') save_folder = "downloads\\" save_name = "wikifrontpage.html" #do the request r = requests.get(url) #download the web page file = open(save_folder, "w+") file.write(response.text) file.close()
First, we import the requests library in a new python file. The requests library provides functions for sending requests to web pages and to retrieve them. That’s all we need for now.
Next, we will call up the web page we want to obtain. In order to get a response, in order to retrieve data we can use the GET request. The GET method retrieves data from an URL that you specify. To make a GET request, invoke requests.get(). Lets make a request.
#do the request r = requests.get(url)
Stored in the variable r is the response. The response is essentially containing the web page behind the URL that we specified. In particular, it contains all the source code of the web page.
In a final step, we take the response and store it as a file on our computer. There it can rest for later analyses.
#download the web page file = open(save_folder+save_name, "w+") file.write(response.text) file.close()
Now, that’s all to say about the basic structure of a crawler.