data collection data processing python

Parsing 101: Extracting data from web pages

Over the past two weeks, I made great progress in collecting data for a new research project of mine. I had to deal with substantial amounts of web content and had to parse it in order to use it for some analyses. I typically rely on Python and its library Beautiful Soup for such jobs and the more I use it, the more I appreciate the little things. Here are the top three new hacks:

Getting the text between tags

I had to extract raw text from web content I scraped. The content I wanted was hidden in a complete mess of HTML tags like this:

</span></div><br><div class=“comment“>
<span class=“commtext c00″>&gt; &quot;the models are 100% explainable&quot;<p>In my experience this is largely illusory. People think they understand what the model is saying but forget that everything is based on assuming the model is a correct description of reality.<p>

Getting the „real“ text out of it can be tricky. One way is to use regular expressions, but this can become unmanageable given the variety of HTML tags.

Here comes a little hack: use BeautifulSoup’s built-in text extraction function.


from bs4 import BeautifulSoup

soup = BeautifulSoup(webcontent, "html.parser")

comment = soup.get_text()

2. No clue what you’re looking for? Prettify your output first

Before I do extract anything, I have a look at the web content–soup helps you get through the code salad with some function called „prettify“ to make it readable:


from bs4 import BeautifulSoup

soup = BeautifulSoup(textstring, "html.parser")

print soup.prettify()

where „name“ is the filename.

3. Extracting URLs from <a> tags

Sometimes you find a link like this and want to extract its URL:

Here’s the code:


from bs4 import BeautifulSoup

soup = BeautifulSoup(textstring, "html.parser")

a=soup.findAll("a")

url=a[1]["href"].lower().strip()

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden /  Ändern )

Google Foto

Du kommentierst mit Deinem Google-Konto. Abmelden /  Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden /  Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden /  Ändern )

Verbinde mit %s

%d Bloggern gefällt das: