The Web holds a truly awe inspiring amount of information, which we're all usually happy enough to access through our Web browser. There are times, however, where your programs need to access it, and you don't want to worry about the details of the HTML mark-up.

There are thousands of HTML (or SGML, or XML) parsing libraries for hundreds of languages out there, but for this example we use a Python library called BeautifulSoup which takes care of almost all of the work for you. The BeautifulSoup library is an extremely helpful tool to have at your disposal, since it not only gives you functions to search and modify your parse tree, but it also handles the broken and malformed HTML you're likely to encounter on an average Web page.

You can download the library at its Web page. It also resides in some popular software repositories, such as the apt-get repository used in the Debian and Ubuntu distributions.

We'll write a Web scraper that prints all the displayed text contained within <p> tags. This is a very simple implementation that is easy to trip up, but it should be enough to demonstrate how using the library works.

First up, we need to retrieve the source of the page that we want to scrape. The following code will take an address given on the command line and put the contents into the variable html:

import urllib2,sys

address = sys.argv[1]

html = urllib2.urlopen(address).read()

Then we need to build a parse tree using BeautifulSoup:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)

At this point the code has already been cleaned up and converted to unicode by the BeautifulSoup library, you can print soup.prettify() to get a clean dump of the source code.

Instead, what we want is to print all of the text, without the tags, so we need to find out which parts of the parse tree are text. In BeautifulSoup there are two kinds of nodes in the parse tree, plain text is represented by the NavigableString class, whereas Tags hold mark-up. Tags are recursive structures, they can hold many children, each being either other Tags or NavigableStrings.

We want to write a recursive function that takes part of the tree: if it is a NavigableString print it out, otherwise, runs the function again on each subtree. Because we can iterate over a tag's children simply by referring to that tag this is easy.

from BeautifulSoup import NavigableString

def printText(tags):
	for tag in tags:
		if tag.__class__ == NavigableString:
			print tag,
		else:
			printText(tag)

Then we just need to run that function on all the <p> tags. We can use BeautifulSoup's in built parse tree searching functions to retrieve all of them:

printText(soup.findAll("p"))

That's it. You've got a fully functioning, if basic, HTML scraper. For more help with searching the parse tree, look up the BeautifulSoup documentation.

The full code for this example is as follows:

Listing A

from BeautifulSoup import BeautifulSoup,NavigableString
import urllib2,sys

address = sys.argv[1]

html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)

def printText(tags):
        for tag in tags:
                if tag.__class__ == NavigableString:
                        print tag,
                else:
                        printText(tag)
        print ""

printText(soup.findAll("p"))

print "".join(soup.findAll("p", text=re.compile(".")))

Do you need help with Python? Gain advice from Builder AU forums

Comments

1

David - 30/11/07

Wow..Until last night, I hadn't written a line of Python. I spent an hour or two reading the BeautifulSoup docs, and proceeded to scrape (with permission) over a thousand pages within a few more hours. The scraping itself took only 12 minutes. BeautifulSoup- very easy to code in.

» Report offensive content

2

Dumbo - 13/04/08

I'm not as smart as they guy above.

Having created the python file and opened it with IDLE i've tried to run it. I don't understand how or when to provide the address on the command line as you've instructed. I'm not prompted for an address on running the program. I get the following message if I just run it straight. Any ideas?

Traceback (most recent call last):
File "C:\Python25\basicscraper.py", line 4, in <module>
address = sys.argv[1]
IndexError: list index out of range

» Report offensive content

3

jake - 15/04/08

you'd do it

basicscraper.py example.com

» Report offensive content

Leave a comment

You must read and type the 6 chars within 0..9 and A..F

* indicates mandatory fields.

3

jake - 15/04/08

you'd do it basicscraper.py example.com ... more

2

Dumbo - 13/04/08

I'm not as smart as they guy above. Having created the python file and opened it with IDLE i've tried to run ... more

1

David - 30/11/07

Wow..Until last night, I hadn't written a line of Python. I spent an hour or two reading the BeautifulSoup docs, and ... more

Log in


Sign up | Forgot your password?

What's on?