Wednesday, October 21, 2009

Using Perl's LWP::UserAgent and HTML::Form Modules to Extract Data From a Web Page

I wanted a script that would display a random quotation each time I logged into my server from the command line. I suppose I could have used Linux's fortune program, but thats been done before, and besides, I wanted something different.

Here's how I used two modules from libwww-perl (LWP::UserAgent and HTML::Form) to extract random quotations from www.quotationspage.com. The libwww-perl collection is a set of Perl modules which provides a simple and consistent application programming interface to the World-Wide Web. The main focus of the library is to provide classes and functions that allow you to write WWW clients.

With amazingly little code, a perl script can be be written that retrieves the HTML returned when a web page is requested, just like a browser. This is accomplished thru the use of perl's LMP::UserAgent, a perl module that can be used to dispatch web requests.

In my case, I also needed to fill out an HTML form on a web page, click a Submit button, and extract specific data from the returned page. Again, very easy using another Perl module, HTML::Form.

In the following example, I hope to show how easy it can be to extract specific data from a web page using just these two Perl modules.

The two Perl modules used in this example are not installed by default. By far the easiest way to install them is by using your distro's package manager to install the libwww-perl package. That way you will get not only the two modules needed for this particular example, but the entire library of Perl www related modules. If you end up doing a lot of web related Perl programming, you will find uses for a lot more of the modules in the libwww-perl collection, so you might as well just install them all now.

I am going to break the script into four units and explain the function of each unit. This should allow anyone reading this post to get a pretty good understanding of how the script works.

#!/usr/bin/perl 
use strict; 
 
use LWP::UserAgent; 
use HTML::Form; 
 
my ($quote_count,$return_count,$user_agent,$response,@forms, 
    @checkbox_values,@form_out,$form,@inputs,$inp,$type, 
    $value,$check_value,$name,$page,$display_count,$quote, 
    $quote_string,$author); 
 
# Store the web form's checkbox names in an array  
@checkbox_values = ("mgm","motivate","classic","coles", 
                    "lindsly","poorc","altq","20thcent", 
                    "bywomen","devils","contrib"); 
 
# If the number of quotes to display is passed as a  
# command line parameter, store it in $quote_count,  
# otherwise set $quote_count to 1 
if (@ARGV) {$quote_count = $ARGV[0]}  
else {$quote_count = 1}  
 
# www.quotationspage.com's form has a minimum number of 4 
# and a maximum of 15 quotes.  If the number of requested  
# quotes is less than 4 or greater than 15, set the number 
# of returned quotes within those limits 
$return_count = $quote_count;  
if ($quote_count < 4) {$return_count = 4}  
if ($quote_count > 15) {$return_count = 15} 
 
# Create the UserAgent object 
$user_agent = LWP::UserAgent->new;
The beginning section lists the Perl modules used by the program (LWP::UserAgent and HTML::Form), declares all the variables, sets the number of quotes to be displayed (also checks if number is passed from the command line), and creates an instance of the UserAgent object.


# Retrieve the form from the webpage and store the form  
# in an HTML::Form hash 
$response = $user_agent->get("http://www.quotationspage.com/random.php3");  
@forms = HTML::Form->parse($response);  
 
# Clear the array that the modified form will be "pushed"  
# into, ignore the first form (it's not the one we want) 
# and store the form hash in an array (@inputs) 
undef(@form_out);  
$form = shift(@forms);  
$form = shift(@forms);  
@inputs = $form->inputs; 
This section retrieves the form from the web page and stores the form data in an array that can be modified.


# Parse the array containing the form data, entering the  
# number of quotes to request, checking all the checkboxes, 
# and filling out the outgoing form to be returned to the 
# web page's php program that processes the form 
for (my $i=0 ; $i<=$#inputs ; $i++)  
{ 
  $inp = $inputs[$i]; 
  $type = $inp->type; 
  if ($type eq "option") # Set the quote count  
  {$inp->value($return_count)} 
  if ($type eq "checkbox") # Check the checkboxes 
  { 
    $check_value = shift(@checkbox_values); 
    $inp->value($check_value); 
  }  
  $value = $inp->value; 
  $name = $inp->name; 
  if ($type ne "submit") {push(@form_out,$name,$value)}  
} 
 
# Send the completed form to the php script that processes 
# the web form, and store the HTML that comes back in a  
# string called $page 
$response = $user_agent->post('http://www.quotationspage.com/random.php3',\@form_out); 
$page = $response->as_string;
Now the form data is modified and "submitted" on the web page


# Parse the HTML stored in $page, extracting the quotations 
# formatting the text, and displaying the requested number 
# of quotes 
 
$display_count = 0; 
 
# Look for a quote in the HTML 
while ($page =~ m/<dt class=\"quote\">(.*?)<\/dd>/gs) 
{ 
  $quote = $1;  
  if ($quote =~ m/\.html">(.*?)<\/a>/) # Extract the quote 
  {  
    $quote_string = $1; 
 
    # Replace any HTML break statements with Newlines 
    $quote_string =~ s/<br>/\n/gs;  
  }  
  if ($quote =~ m/<b>(.*?)<\/b>/) # Extract the author 
  {  
    $author = $1; 
    # If the author is imbedded in a link, remove the link HTML 
    if ($author =~ m/\/">(.*?)$/)  
    {  
      $author = $1; 
      $author =~ s/<\/a>//;  
    } 
    $display_count++; 
    if ($display_count <= $quote_count)  
    {print $quote_string." - ".$author."\n\n"}  
  } 
}
The final section extracts the quotes from the HTML of the page that was returned when the "Sumbit" button was pressed and displays the quotes.


The entire Perl script is available here, if you would like to simply download the script.

Special thanks to Vivian's Tech Blog for the code that permitted me to display the code in the shaded boxes!

2 comments:

  1. Great post! Just what I was looking for. Thanks for simplifying the read and parse features of LWP::UserAgent and HTML::Form. Found one minor error in your code as displayed on this page (OK in your link). Add ->new call to the statement below.

    # Create the UserAgent object
    $user_agent = LWP::UserAgent->new;

    ReplyDelete
  2. You are quite welcome! I am glad that people find my posts useful.

    You were correct regarding the error in my sample code; I corrected the error, so the code in the post is now correct.

    Thank you for pointing out the error!

    ReplyDelete

If you happened across my blog and find some of the information contained here useful, you have a question, comment, suggestion or perhaps (gasp!) a correction? Please, take a minute and leave a comment.

>