data collection featured getdata

Crawling 104: Retrieving data from APIs

Next up, we will be dealing with probably the most convenient way of retrieving data for our research projects, that is via APIs. Here’s the plan for this section:

Table of contents

  1. What is an API?
  2. Why should we prefer APIs for data collection?
  3. Retrieving data from APIs
  4. Advanced data retrieval from APIs

What is an API?

API stands for application programming interface. I guess you have come across this term before. In fact, an API is nothing else than an interface to a database that helps you retrieve the data your interested in. For example, the Wikipedia API allows automatically grabbing Wikipedia articles and their histories. Instead of writing a complex Wikipedia crawler, we might instead tell the API what data we need and get it delivered right away.

Here’s how APIs work. The API is like a bouncer for a database: it regulates access to the database. The API knows who is allowed to access the data and what part of dat can be accessed by whom. But the API can do even more. We can tell the API exactly what we need and the API will then extract what we need from the database, in a much faster und less complex way than we could do if we had to interact with the database ourselves.

In practice, the typical interaction with an API is tht we send a „data request“ to the API. The API then checks whether we’re allowed to access the data. If that’s ok, the API compiles what we requested and, after some time, provides us with a response of the data we requested.

Why should we prefer APIs for data collection?

When searching for data sources, we should always look out for APIs first. APIs provide usually the simplest way to get the data we need and already cleaned up so we can start with analysis soon afterwards. In addition, APIs are typically provided to avoid that crawlers and scrapers are putting pressure on the main web service–so it is also the friendliest possible way of getting to data.

Hands-on: Retrieving data fron APIs with Python

Here’s our template code for requesting code from APIs. Have a look at it and then we will go block by block through it. In this example, we will be working with the Internet Archive API. The Internet Archive API allows us to look for historical versions of a web page. It is a great way of obtaining historical data, more about this in data sources.

#import the packages we need
import requests
import json

#setup
address="https://en.wikipedia.org/wiki/Main_Page"
url = ('http://web.archive.org/cdx/search/cdx?url=' + address + '&output=json')

#do the request
r = requests.get(url)

#print the response
print(r.content)

First, we import the requests library in a new python file. The requests library provides functions for sending requests to web pages and APIs and to retrieve them. We also import the JSON package, because the API we deal with responds with JSON-formatted data.

import requests
import json

Next, we will send the actual request to the API. As we discussed above, we need to send a get request to the API and then receive a response upon that. Lets make a request and save what the API returns.

#do the request
r = requests.get(url)

Before we can access the data the API sent us, we need to make it readable. This is what happens in the final lines of code. We access the main content of the response and read the JSON data right away.

#print the response
response = r.json()
print(r.json())

Now, let’s take a look at the data. The request will return all copies of the Wikipedia frontpage in the Internet Archive. Since Wikipedia is an important website, it has been frequently stored in the Internet Archive. Here’s an extract of the data:

[["urlkey","timestamp","original","mimetype","statuscode","digest","length"],
["org,wikipedia,en)/wiki/main_page", "20031202224749", "http://en.wikipedia.org:80/wiki/Main_Page", "text/html", "302", "UWVMGXHQNWGCSOGUQXZDBPFHXBUVZAL3", "410"],
["org,wikipedia,en)/wiki/main_page", "20040209222712", "http://en.wikipedia.org:80/wiki/Main_Page", "text/html", "200", "VRQGWT3AN7XANBMZMS3RPLGDY5NGI5MY", "6592"],
["org,wikipedia,en)/wiki/main_page", "20040209222712", "http://en.wikipedia.org:80/wiki/Main_Page", "text/html", "200", "VRQGWT3AN7XANBMZMS3RPLGDY5NGI5MY", "6592"],

Now, that’s all to say about the basic structure for using Python to retrieve data from APIs. If you want to learn how to cleanse the obtained data and make it usable for your statistics software, continue here.

Advanced Tips

1. Checking response status

Along with the data we also receive more information from the API in terms of a response status. For example if a request returns a status code 200 then everything is OK. Lets print the status code of the above request.

print(request.status_code)
200

200 means everything is OK. But it could also be that our response is erroneous. Here’s a list of what the response numbers mean:

Code Status Cause
200 OK
400 Bad Request – You requested data that is not available
– Your request didn’t follow the guidelines of the API
401 Unauthorized – You forgot to login before requesting the data
403 Forbidden – You are not authorized to access the requested data
– You have been blocked from the API because you requested data too often or you exceeded your capacity limits
404 Not Found – API or data does not exist anymore (maybe the web page has moved to another address)
429Too many requests– You have sent too many requests to the API
500 Internal Server Error – API or data does not exist anymore (maybe the web page has moved to another address)
503 Service Unavailable – You have been blocked from the API because you requested data too often or you exceeded your capacity limits

Therefore, I usually recommend checking for a 200 response before reading the API response. This avoids that we run into errors because something went wrong with our API request.

if request. status_code==200:
    m = response.read()
    ...

2. Authenticate or Login

Some APIs require that you authenticate, say with a user/password combination or with an API key. The exact way how to do this depends usually on the API you are dealing with and you get the most precise information from your API provider. However, here is how it usually works. There are typically two ways that APIs handle authentication. First, via a classical user/password combination. You can do this right in your request statement by sending the API the additional authentication information it needs.

#setup
address="https://en.wikipedia.org/wiki/Main_Page"
url = ('http://web.archive.org/cdx/search/cdx?url=' + address + '&output=json')
payload = {'user_name': 'admin', 'password': 'password'}

r = requests.get(url, params=payload)

Second, some APIs give you an API key and you need to pass the key to the API in an „header“. In such a case, you would specify the API-key in the payload and pass it to the „headers“ part of the API request.

#setup
address="https://en.wikipedia.org/wiki/Main_Page"
url = ('http://web.archive.org/cdx/search/cdx?url=' + address + '&output=json')
payload={'API-Key': 'key'}

r = requests.get(url, headers=payload)

3. Dealing with API request limits

Many APIs throttle your access to the data in order to prevent that the database gets flooded with requests. Daily API requests are often limited to a few thousand. What are best practices to live within these limits?

First of all, I would recommend carefully monitoring your API usage. Especially when programming your API crawler, pay attention to not already consume your API limit by the tests you conduct with your crawler. For example, if you test code that sends multiple data requests to an API, then add some breaking point to your code that exits after a few requests.

In addition, one can take a step back and carefully plan what data is really necessary. In addition, it is important to understand what the provider counts as an API request–maybe there is a way to combine requests.

Still, even if we manage our requests properly, limits may be too strict for the data we need. If we exceed the limits, the API usually answers our request with a status like „429“ or „503“ and our data collection is gone at least for another day.

What can we do if we exceeded the API limits? It is usually a bad idea trying to dodge the API. The limits are usually there for a purpose and by sending you a 429 response, the API is kindly asking you to obey the rules. Spoofing an API is against the rules. In my experience, it is best trying to talk to the data provider and tell them about your research project. Most of the providers I have been talking to were happy to provide me with greater limits because they knew that I would only be using the APIs for some days or weeks and then be stopping my crawlers.

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden /  Ändern )

Google Foto

Du kommentierst mit Deinem Google-Konto. Abmelden /  Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden /  Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden /  Ändern )

Verbinde mit %s

%d Bloggern gefällt das: