Paul Makepeace ;-)

November 1, 2006

Backstage: Get with the Program

Posted in: Software, Tech

There was a neat idea recently on the BBC backstage list to produce a tag cloud of words in subject lines. There was an implementation of this but as usual with almost everything posted on Backstage no code.

It's frustrating that the BBC go to all the trouble of making their data available and then developers horde their little snippets of code. I just don't understand it. Big kudos to MighTyV who do open source their code. And deservedly won the BBC Backstage competition.

So, without further ado, some code to implement tag clouds off mailing list subject lines:


(I generated from a mailbox earlier in the year which has more topical subject lines.)

#!/usr/bin/perl

use warnings;
use strict;

use Email::Folder;
use HTML::TagCloud;

my $mbox = shift || "$ENV{HOME}/Mail/Lists/Backstage";

my $cloud = HTML::TagCloud->new;
my $folder = Email::Folder->new($mbox);
my %word_count;

foreach my $subject (map { $_->header("Subject") } $folder->messages) {
	my @words = grep !/^(to|you|we|the|and|would|on)$/,
	            grep /^\w+$/, split ' ', $subject;
	$word_count{$_}++ for @words;
}
$cloud->add($_, 'http://realprogrammers.com/', $word_count{$_}) for keys %word_count;

print $cloud->html_and_css(50);

(I'm happy to comment this for anyone wishing to play with it.)

Posted by Paul Makepeace at November 1, 2006 12:49 | TrackBack
Comments

Hi Paul - I didn't open source the code for the tag cloud because there is no code :-)

Here's what I did:

Went to http://www.tagcrowd.com, copied/pasted the text from the mailing list archive page (http://www.mail-archive.com/backstage@lists.bbc.co.uk), added the words "re, mr, hi, hello" to a blacklist so they're excluded, and pressed "Visualise!". Copied/pasted the resulting snippet into a blog post, and hey presto. Not a single line of code (not mine anyway...).

Posted by: Mario at November 1, 2006 13:21

At present the code is case-sensitive; to fix this, change "$word_count{$_}++ for @words;" to "$word_count{lc($_)}++ for @words;".

Posted by: Earle Martin at November 1, 2006 14:15

Thanks Earle.

Mario: Code is steps in a process as much for humans as computers, so thank you for open sourcing your steps in the process for the rest of us!

It just occurred to me that the MighTyV project was the catalyst for the development of HTML::TagCloud. Funny ol' small world, eh?

Posted by: Paul Makepeace at November 1, 2006 14:37
Post a comment









Remember personal info?