Paul Makepeace ;-)

October 29, 2006

Converting Movable Type URLs

Posted in: Movable Type, Software, Tech

Back in Aug of 2004, I switched my blog to use descriptive URLs rather than the old-style, Movable Type 2 default of /archive/002981.html.

The downside to this is that after the site rebuild I had effectively created a whole 'nother blog, one all crosslinked in the new style, and one in the old style. And because the rest of my site has random links into the old style from before the change it continues to be crawled. To this day I still get search referrals pointing to the old site.

Google for 'rent scam' shows two results, both my site, both essentially same content

This isn't so bad except for two things. Comments go to the new site, not the new one. So referrals where someone's looking for rent scam help would, on the old site, miss out on all the useful (and entertaining in some cases) comments. The other bad thing is that the old site isn't running ads. OK, you might think that's a good thing :-)

So tonight I decided to fix this. Here's how I did it.


Overview: Redirect!

The strategy I took was to produce a set of redirects from the old site to the new site. The best way to do a redirect is at the webserver level as it teaches crawlers and browsers where the new pages are. There are in fact two types of redirects, temporary (aka "302") and permanent (aka "301"). Apart from being correct, the latter is also the preferred-by-search-engines method--tales abound of sites losing PageRank for using 302 temporaries.

Using the Apache webserver, I needed to produce a whole pile of RedirectPermanent directives to be included in paulm.com's Apache configuration.

MT code to the rescue

For this I need to get a list of all my blog's entries and create URLs for their new version. I can create the old one easily as the URL is the blog entry's internal ID, zero-padded out to six digits. To create the new URL I had to dig around a little in MT's guts and call MT code directly.

Here's the answer,

export PERL5LIB=/home/mt/cgi-bin/lib
perl -MMT -MMT::Blog -le '
$mt = MT->new(Config=>"/home/mt/cgi-bin/mt-config.cgi");
$b = MT::Blog->load(13);
@e = MT::Entry->load({blog_id=>13});
for $e (@e) {
printf "RedirectPermanent /inchoate/archives/%06d.html %s\n", $e->id, $e->archive_url;
}' > /etc/apache2/paulm.com-redirect.conf

OK, what's going on here. First we need to create an instance of the top-level MT object. (Here we see an awesome example of how to abuse Perl's object model--subsequent object instantiations, e.g. MT::Blog make no reference at all to that MT instance, it all just magically works. Ah, pixie dust.)

Next up we load up my blog object. I know its id is 13 by looking in the mt.cgi URL: blog_id=13. Alternatively you can look in the mt_blog database table. I don't actually need this line of code as it turns out I can pull the blog entries out directly. I decided to leave this line in there for some extra documentation and reference.

So the load() method works with several MT types, the ones sublcassed from MT::Object, and has a fairly rich interface. This example shows loading all the entries of blog_id 13. Reassuring when I print scalar(@e) I got 336 which is the same number as reported on my blog dashboard page.

After divining that MT::Entry::archive_url was the right method for printing an entry's URL the ball's in the net. The final piece was manually constructing the old-style URL using printf "%06d" which says "print this decimal zero-padded to six digits".

Now in my /etc/apache2/sites-available/paulm.com I simply added,

Include /etc/apache2/paulm.com-redirect.conf

and kicked the webserver,
apache2ctl config && apache2ctl graceful

(&& is an improvement on ; in that it'll only execute the next command if the first succeeded.)

Finally I of course needed to test it worked, paulm.com/inchoate/archives/002950.html. Yep!

And in the logs,

perl -lane 'print if $F[8] == 301 and $F[6] =~ /\d{6}/' /var/log/apache2/paulm.com-access.log
x.x.x.x - - [29/Oct/2006:23:46:20 +0000] "GET /inchoate/archives/002919.html HTTP/1.1" 301 279 "-" "Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.8.0.7) Gecko/20060909 Firefox/1.5.0.7"

Do I need all those redirects?

Now, the exceedingly smart and observant reader will note that I haven't used the old-style URL for two years. So two years' worth of blogs aren't written out to the disk in that old format and that means two years' worth of 301 redirects I don't need to have Apache consider. So I'm going to clear them out of the paulm.com-redirect.conf. One option might be to re-run the script above and put the date as a constraint in the MT::Entry->load() call. This would require me to figure out how to do that and, being lazy, I just can't be bothered (if you know, feel free to leave a comment). So I'm going to throw my sys-admin skills at the job instead.

But... I can feel a two-in-one trickshot coming on. But to set up this trickshot, another observation:

I still have "two" blogs. I need to remove those old pages. It would be easy to just go ahead and wipe them out with rm inchoate/archives/00*.html but I'm curious to see if anything's left after I remove only we know about.

Can you see the trickshot? I'm going to remove the unnecessary redirects and at the same time remove the old files, all in a one-liner. So, I re-used the results of the last bit of code (redirect.conf) pulling out the old-style URLs (/inchoate/archives/002918.html) and turning them into their place on the filesystem. Now I can test to see if they exist and if so, remove them and keep the redirect line, and if not, remove the redirect line.

perl -ani~ -e '-f ($f = "/home/www/paulm/paulm.com$F[1]") and unlink($f), print'  /etc/apache2/paulm.com-redirect.conf

This scary bit of code does an in-place modification (-i) of the redirect file saving a copy with the ~ extension in case I make a mistake. I make use of the surprisingly little-used -a switch which splits the line into the @F array ('a' for 'awk' which by default has these in $1, $2, etc). So $F[1] is the old-style URL. If the file exists, the -f test, then unlink (remove) it and print the line. The effect of printing the line retains it from file.

So as an aside we see two perl one-liner idioms,
perl -ni~ e 'print if $some_condition' to remove lines except some and perl -pi~ -e 'do_something_with_each_line(); # typically an s///'

Back to the job: I want to observe what happens, so I take before and after shots with ls $paulm/inchoate/archives/00*.html; wc -l /etc/apache2/paulm.com-redirect.conf. I.e. take the before shot, run the above perl line, and re-run the ls; wc -l "after" shot. For the latter I get, reassuringly,

ls: /home/www/paulm/paulm.com/inchoate/archives/00*.html: No such file or directory
47 /etc/apache2/paulm.com-redirect.conf

Since this has changed my apache configs another webserver kick's needed.

And categories, monthly archives, RSS, and...?

It's all well changing the entries but what about the old-style archives/2004_08.html? Laziness prevailing here again I checked the last week's logs for references,

ls $paulm/inchoate/archives/200*_*.html # this revealed four from 2004, hence:
grep archives/2004_0 $logs

Nothing. Delete 'em: rm $paulm/inchoate/archive/200*_*.html

The same wasn't true for categories unfortunately,

grep cat_ $logs

yielded one repeated result appearing. I traced this to a single blog entry which I manually edited and rebuilt. After that I deleted the cat_*.html pages.

The old RSS feeds could go, rm $paulm/inchoate/archives/[0-9]*.xml $paulm/inchoate/[0-9]*.xml

Correcting old links

We're not done yet! I want to make sure the rest of my site isn't referring to the old pages. No harm would come of it if I left it as is but I'd rather finish the job, and it'll keep my logs clean too.

My site's built from little XML fragments and full HTML pages, so I'm going to look in all of them and switch references to the old style,


find $paulm \( -name '*.html' -o -name '*.xml' \) | xargs grep -l 'inchoate/archives/00' | tee /tmp/old-style-pages

This returned a bunch of blog and non-blog (i.e. "home page") entries. The blog entries will require actually fixing the entry in the database, but I can fix the static files easily enough right now. I can also fix the blog files and they'll be fine until I do another rebuild.

There were tantalisingly few there (about four)--I was sorely tempted to hand-edit them. But no, see if I can whip out another one-liner in less time than it takes to look each one up:

</tmp/old-style-pages xargs perl -pi~ -MFile::Slurp=read_file -le '
BEGIN {
%h = map { (split)[1 => 2] } read_file("/etc/apache2/paulm.com-redirect.conf");
s~http://paulm.com~~ for values %h;
$re=join "|", map "(?:http://paulm.com)?$_", keys %h
};
s/($re)/$h{$1}/ge'

The neat trick here is that I'm doing an in-place file change while using the BEGIN clause to slurp in the redirect file and make a hash mapping the old style pages ("word" 1 of the line) to the new URL ("word" 2). Skipping the next line for a second I create a big regex to match any reference the old-style URLs.

Now those mystery references to http://paulm.com: This is quite subtle. The references in the script to my base URL http://paulm.com combine to remove that from the URLs in the pages. By not replacing with the full URL, the script also works on matches on filesystem references, e.g. /home/www/paulm/paulm.com/inchoate/archives/002918.html. Without it you'd end up with /home/www/paulm/paulm.comhttp://paulm.com/inchoate... (I didn't get this the first time round; only after I ran the script on one page and noticed it on a file that did an INCLUDE of a blog entry).

(By the way, if you're thinking, my god, who writes scripts like that off the top of his head? The answer is, if you're prepared to experiment, play, and believe it's possible: "you")

What next?

The test for success here is no 404s (missing pages), just 301s (permanent redirects). So over the next few days I'll be keeping an eye on my server logs using techniques shown above, and tail -f the error log in a window.

Moral

Changing one's URL scheme is not something to be undertaken lightly! However, if you do choose to, having a blogging engine that enables relatively straightforward construction of scripts and one-liners to manipulate its data renders the job a matter of a half-hour or so of reading docs and experimentation.

And of course behold the power of perl and unix to perform sophisticated data transformations and get the job done!

Posted by Paul Makepeace at October 29, 2006 22:31 | TrackBack
Comments

Paul, the thoroughness of your geekhood is continually inspiring. :) And, fairly typically, your perl continues to be nigh unreadable, particularly that last chunk. It's good to know that some things never change!

Posted by: wilhelm at June 29, 2007 03:03
Post a comment









Remember personal info?