In my last article I provided a gentle quick start to working with Python. If you're not familiar with the language you might want to go back and give that a read. Now that you've got the basics firmly under your belt, it's time to start putting them to use by writing something a little more interesting. A few weeks ago I received an e-mail from a Web host letting me know that my space was about to expire, and that I had a month to back up all my files before they were deleted. Now since I was only storing a few old photos on this particular host, it was no big loss, but I would still like to keep those pics. Rather than saving the files from the Web page one by one, or going through the Web host's administration page I wanted to write something to handle it all for me, so we're going to be walking through the development of a command line program that will parse a Web page and print the addresses of all images used on that page. By the end of this article we'll have run through opening and reading HTML data through HTTP, defining functions, adapting to various user inputs and using regular expressions to parse text (briefly).
Defining Functions
First we need to run through one more basic language feature of Python: the Function. Functions let you set aside a block of code and give it a name, so that instead of typing the whole block of code each time you want to use it, you can just refer to it by name. Defining functions in Python is simple:
def hello(name):
print "hello " + name
The word directly after the def keyword is the name of the function, and the words inside the parentheses are the names of the parameters -- the input to the function. Calling functions is just as easy:
>>> hello("world")
hello world
>>> hello("everyone")
hello everyone
Using functions is generally a good idea in all kinds of programming, since they reduce maintenance issues introduced by copying and pasting code, and allow you to group code together by what it does, making your program easier to read and maintain.
Managing user input
Whenever your program relies upon input from the user to work, you're going to run into the problem of incorrect input. Most of the time it's enough to just fail gracefully, printing an error and closing the program, but sometimes you can go one better and correct the input and continue. In this program, the user must give the program a web address as an argument and so we need to check that the input is an address that we can work with -- this program is only for Web sites, and so can only accept addresses using the HTTP protocol. We'll write a function that will check this, and add the http protocol specification if none is give. The full function is below, don't worry if you don't understand it right away, we'll go through it in more detail:
def parseAddress(input):
if input[:7] != "http://":
if input.find("://") != -1:
print "Error: Cannot retrieve URL, protocol must be HTTP"
sys.exit(1)
else:
input = "http://" + input
return input
Firstly we define the parseAddress function, which takes one parameter -- called input. Next we need to determine if we've got a correct address, so we check if the start of the string (remember that input[:7] returns a slice of the string input, from the beginning to the seventh character) is "http://" -- if it is, no problem, we've got what we need. Otherwise, we could fail, but if there's no protocol specified, we'll just assume that the user gave an http address without adding the protocol specification at the beginning. We can check for the presence of a specification by using the string method find. find works on a string and a substring, and returns the index of the first match or -1 if the substring does not occur, like so:
>>> "hello world".find("hello")
0
>>> "hello world".find("wor")
6
>>> "hello world".find("word")
-1
Let's test this function (Note: if you're trying this in the interpreter, remember to import sys for the exit function):
>>> parseAddress("http://www.builderau.com.au")
'http://www.builderau.com.au'
>>> parseAddress("www.builderau.com.au")
'http://www.builderau.com.au'
>>> parseAddress("ftp://builderau.com.au")
Error: Cannot retrieve URL, protocol must be HTTP
Opening and reading HTTP addresses
Python has a wide range of modules in its standard library that make otherwise complicated tasks very simple; in this instance we're going to use the urllib2 module to take the work out of opening Web pages. Opening and reading Web sites using urllib2 is as simple as opening text files:
import urllib2 website = urllib2.urlopen(address) website_html = website.read()
Just like with files, things can go wrong when you try to open addresses on the Internet, maybe the server is down, or your Internet connection might be broken, or maybe the file you're looking for just doesn't exist. Whatever the reason, you need to be able to handle these little problems, and in Python the right way to do that is through exceptions. urlopen can throw a number of different exceptions, but the major two you need to know are HTTPError, which are raised when the Web server you connect too sends an error code, and URLError, when another network or protocol error occurs. You can catch these exceptions like any other:
try:
website = urllib2.urlopen(address)
except urllib2.HTTPError, e:
print "Cannot retrieve URL: HTTP Error Code", e.code
except urllib2.URLError, e:
print "Cannot retrieve URL: " + e.reason[1]
So, for example, when you try to retrieve a URL that does not exist you'll see an error message like:
% python2.4 images.py www.google.com/doesnotexist Cannot retrieve URL: HTTP Error Code 404
Now maybe that's enough information for an error code like 404; most of us have seen those before, but how many could tell you off the top of their head that error code 407 means that proxy identification is needed, or that 503 means that the server is under a high load and cannot process the request? Clearly we need a more human friendly way to tell the user about errors, and again, Python provides -- this time with a helpful dictionary defined in the module BaseHTTPServer. A dictionary is another basic Python type, in other languages it is sometimes called a hash or a map, we'll go into more detail about dictionaries another time but for now you can think of it as a list except rather than retrieving items by index, you can retrieve them by any kind of identifier you like. In this case, the dictionary BaseHTTPRequestHandler.responses provides a mapping between error code and explanation -- if you're interested in the full list see section six of the HTTP 1.1 specification RFC. So the following code:
import BaseHTTPServer print BaseHTTPServer.BaseHTTPRequestHandler.responses[404]
Produces the following output:
('Not Found', 'Nothing matches the given URI')
We can use this dictionary to print more sensible error messages.
Do you need help? 





1
shakil - 05/05/07
hi
Actually i need to know that how can i download a ftp file from ncbi by using python module ftputil.
please help me.
Thanks
regards,
Shakil
» Report offensive content
2
Stefan Schwarzer - 03/06/07
Hi Shakil,
Posting questions on ftputil on its mailing list at http://codespeak.net/mailman/listinfo/ftputil is more promising than asking here. :-) It was just by chance that I found your question.
The first argument of the FTPHost constructor has to be the name of the FTP server you want to connect to. In your case, that's just ftp.ncbi.nih.gov .
The error message you see ("getaddrinfo failed") is passed on from ftplib resp. the socket module and states that no IP for the hostname "ftp.ncbi.nih.gov/repository/OMIM/morbidmap" could be found. That makes sense because that is no hostname, "ftp.ncbi.nih.gov" is one, though. :)
» Report offensive content
3
mak - 04/10/07
Excellent article. Good for people who already have a programming/ scripting background.
» Report offensive content
4
Van - 07/01/08
Great article, I'll be patiently waiting for more :)
Just out of curiosity, what function would you use to download the files with the printed list?
» Report offensive content