Paul Makepeace ;-)

November 9, 2004

Comment spam new direction

Posted in: Movable Type

Everyone knows about MT-Blacklist but there are some problems with it. This entry is a quick jot for an idea I had actually last year for re-using technology that's proving successful with rejecting mail spam: focussing specifically on identifying URIs of spammers, rather than looking at the whole comment/message body.

Before that I should preface this with the disclaimer that I only use the 1.x series for my own hacked MT2.66. I should also disclaim that I'm barrelling this entry out before heading to the pub so it's not as linked-up or researched as I'd like. So, onward:

MT-Blacklist works by scanning a huge list of search queries ("blacklist") against incoming comments. The problem with this is that it's slow. Simply firing up mt-blacklist.cgi on this 1.8GHz server requires over ten seconds. The other problem is managing the blacklist: everyone running MTBL has their own copy, and a lot of the code in MTBL is aimed at managing this list.

How do the mail spam guys solve this? They identify links in the spam body and compare those against a list, essentially asking "is this website linked to a spammer?" The neat thing is that DNS supports this very easily. Such a system exists now at SURBL, the Spam URI Realtime Blocklist.

So the idea is simply this: produce a blocklist and supporting software that contains a list of URIs that are appearing in comment spam.

The additional bits of software required are some method of reporting a URI and then entering it into the DNS. The SURBL guys solved this by using the SpamCop feed (link?). Something similar needs to be produced for comment spam since the overlap from mail spamming to comment spamming is apparently not as large as you might expect (reference?). That's to say, using the existing SURBL data for identifying comment spam isn't that effective.

One interesting idea I'm not sure if it's being used is to watch the nameserver query statistics to "pre-report" comment spams. My experience running MT for a few sites is that the comment spamming comes in vast, crushing waves that create load spikes. The system could then auto-list URIs for moderation (rather than outright blocking). I suspect this would be quite effective.

I actually have the apparatus here having developed such a system. When I get a moment I might play with it.

Coincidently, today Brad Choate's tackled a similar problem by porting a Wordpress plugin that identifies open proxies used for spamming, MT-DSBL.

Posted by Paul Makepeace at November 9, 2004 19:36 | TrackBack
Post a comment

Remember personal info?