featured getdata

Crawling 102: Collecting web data with Selenium and Python

Selenium is a powerful tool for collecting web-data. With the help of Selenium, we can collect data from pages that load content dynamically or on which data resides behind login-screens. In this blog post, we show you how to harness Selenium’s power with just 20 (!) lines of code. We illustrate how Selenium can be used to collect data.

Table of contents

  1. What is Selenium?
  2. Why is Selenium so powerful for data collection?
  3. Selenium setup
  4. Selenium crawler: basic structure

1. What is Selenium?

Selenium was not intended to be used for collecting web-based data. Selenium is a framework for testing web pages. Developers use it to simulate the interaction between web browsers and web pages. In particular, developers can use Selenium to test whether their web pages run into errors given different types of browsers and simulated user behaviors.

2. Why is Selenium so powerful for data collection?

There are two main hurdles for collecting web-based data. First, some web pages may be dynamic. On dynamic web pages, content is being displayed in a piecemeal fashion. Dynamic web pages display further content the further you scroll down or require you to click a separate button. For example, the New York Times shows article previews and, in case you want to read the entire article, you need to click on an extra button „show full article“ to load the content. Assuming that we tell our standard crawler to fetch articles from the New York Times, the crawler would only collect article previews, but not the entire article. More general, dynamic content is difficult, because we need to tell the crawler to search for such buttons and press them–which is inherently difficult to do.

Second, some data that we are interested in may be gated (e.g., login screen, captchas). While we are authorized to access the data, we need to tell our crawler how to authorize and get beyond login screens.

We can use Selenium to overcome both hurdles. Selenium has built-in procedures for loading dynamic content, pressing buttons, entering data into login screens and more. At the same time, we can use Selenium very easily within our existing Python environment.

3. Selenium setup

Before getting started, we need to do some ground work. First, we install Selenium on our computer.

1. Install Selenium

Open the Python console and type:

pip install -U selenium

2. Get Chromedriver

Next, we need to get the browser that can be used in Selenium. We will use „Chromedriver“, which is used to simulate Google’s Chrome browser. However, all other browsers can be used as well. Let’s get the Chromedriver.

Choose the chromedriver based on your system requirements and download it on your computer.

https://sites.google.com/a/chromium.org/chromedriver/downloads 

The downloaded file will contain a file called „chromedriver.exe“. Copy the chromedriver.exe to (the folder will not exist, so create it first):

C:\Program Files (x86)\Chromedriver

3. Configure Chromedriver

Last, we need to tell Windows where Selenium can find Chromedriver once we will start using it. This is the most important step, so proceed carefully.

Press Windows button and „R“ to open the command window. Type „sysdm.cpl“ and hit enter.

In the window, click on the tab „Advanced“. (Now you know that it is getting serious!)

On the advanced tab, click the button „Environment Variables“.

In the lower part of the window, there’s a list of system variables. Select the „path“ variable and click the button „Edit“.

In the opening screen, press „New“. Paste the following: „C:\Program Files (x86)/chromedriver.exe“. Press „OK“ to leave the screen.

Last, press „OK“ and close the remaining windows. We’re done!

4. Basic Structure of the Selenium Crawler

We can set up a Selenium crawler in Python with only 20 lines of code. Here comes the basic structure of the crawler and we will go line by line through it.

#housekeeping
import os
import sys
from selenium import webdriver
import codecs

#set our project folder
path = os.path.join(os.environ["HOMEPATH"], "Desktop\\example\\")

#start Selenium with Chrome
driver = webdriver.Chrome(path+"..\\chromedriver.exe")

#open Wikipedia front page in Selenium
driver.get("https://en.wikipedia.org/wiki/Main_Page")

#get the source code of Wikipedia front page
html = driver.page_source

#save the Wikipedia frontpage on our computer
file_object = codecs.open(path+"wiki.html", "w", "utf-8")
html = driver.page_source
file_object.write(html)

Import

First, we need to import the packages that we need. Business as usual. For the Selenium crawler, we need the OS package, the SYS package, the SELENIUM package (of course) and the CODECS package. The OS and SYS package help us with storing and accessing files on our computer. The SELENIUM package contains the tools and functions we need today, namely for crawling dynamic web pages. The CODECS package contains functions for storing the web page.

#housekeeping
import os
import sys
from selenium import webdriver
import codecs

Set project folder

Next up, we will set our project folder. This is were we will store the crawled web pages.

#set our project folder
path = os.path.join(os.environ["HOMEPATH"], "Desktop\\example\\")

Start Chrome in Selenium

Now, we run Chrome in Selenium. This happens by creating a driver object and setting the path to the chromedriver.exe that we had downloaded earlier.

driver = webdriver.Chrome(path+"..\chromedriver.exe")

Open the Wikipedia front page

We want to store the Wikifrontpage using Selenium. Similar to the GET request that we used to crawl static web pages, SELENIUM also implements a GET function. We use the GET function to tell Selenium to navigate to the specified URL.

#open Wikipedia front page in Selenium
driver.get("https://en.wikipedia.org/wiki/Main_Page")

#get the source code of Wikipedia front page
html = driver.page_source

#save the Wikipedia frontpage on our computer
file_object = codecs.open(path+"wiki.html", "w", "utf-8")
html = driver.page_source
file_object.write(html)

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden /  Ändern )

Google Foto

Du kommentierst mit Deinem Google-Konto. Abmelden /  Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden /  Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden /  Ändern )

Verbinde mit %s

%d Bloggern gefällt das: