In the previous post, I showed how WWW::Mechanize and HTML::TokeParser can grab content from a website, and parse that content to produce the information required for an RSS feed. So how do I make a feed with all that?
Well, what I don't do is reinvent the wheel. I found a perfectly reasonable Perl module called XML::RSS that produces RSS feeds. But before we look at that module, let's take a look at the information produced from the previous two posts. Here's the code from before:
- while (my $div_tag = $stream->get_tag("div")) {
- if ($div_tag->[1]{class} && $div_tag->[1]{class} eq "MBEntry") {
- my $id = $div_tag->[1]{id};
- $subject{$id} = get_subject($stream);
- $pubDate{$id} = get_pubdate($stream);
- $text{$id} = get_entry($stream);
- }
- }
At the finish of this loop, this is what's in the subject and pubDate Perl hashes:
0 HASH(0x1217ef120)
'entry_175686' => 'Bruins Win on Montreal Ice'
'entry_176300' => 'Bruins Looking Good'
'entry_176899' => 'S.L. Price on Paul Pierce'
'entry_177491' => 'The Hip Diana Taurasi'
'entry_178076' => 'Calling the Shots'
DB<5> x \%pubDate
0 HASH(0x1217f02e8)
'entry_175686' => 'Sun, 23 Nov 2008 09:47:00 EDT'
'entry_176300' => 'Sat, 29 Nov 2008 21:37:00 EDT'
'entry_176899' => 'Fri, 05 Dec 2008 21:53:00 EDT'
'entry_177491' => 'Thu, 11 Dec 2008 22:12:00 EDT'
'entry_178076' => 'Thu, 18 Dec 2008 23:34:00 EDT'
I've omitted the text hash from this listing. Take careful note that the keys are unique to each BLOG post. From the listing, you can see that "Bruins Looking Good" (line 4) corresponds to BLOG post entry_176300, and it was published Saturday, November 29 (line 12).
(Perl monks will recognize that this output is from the Perl Debugger. I remember when I first stepped through Perl code in this debugger; my first thought: "Now I can't be stopped!")
The code to pass this information to XML::RSS is straightforward. Here's an excerpt:
- my $rss = new XML::RSS (version => '2.0');
- $rss->channel(title => 'Rick on Sports',
- link => 'http://www.sportingnews.com/blog/rickumali',
- # More code omitted
- );
- foreach my $id (sort {$b cmp $a} keys %subject) {
- my $display_id = $id;
- $display_id =~ s/entry_//;
- my $truncated_text = truncate_text($text{$id},160);
- $rss->add_item(title => $subject{$id},
- link => "http://www.sportingnews.com/blog/rickumali/$display_id",
- pubDate => $pubDate{$id},
- permaLink => "http://www.sportingnews.com/blog/rickumali/$display_id",
- description => $truncated_text,
- );
- }
Line 6 is the Perl idiom for producing a descending sort. Line 7 is me removing the "entry_" prefix from the BLOG ID. Line 9 produces truncated text, so the full text of the BLOG isn't in the feed (thus driving subscribers to the site; all three of them). The feed can be seen at Feedburner.
I welcome you to download the code, and try to make sense of it yourself. If you decide to use it for some website you know, you may find yourself editing the code to fit your needs. Go for it. And someday soon on this blog I'll talk about the time I wrecked havoc at my hosting company with the code in line 9.
The full source code is at feedsn.pl.