Main Content

Keeping forum and blog comments clean

Archive - Originally posted on "The Horse's Mouth" - 2012-03-19 08:09:16 - Graham Ellis

We're all getting far too used to having to type in a word that's shown in an image, to answer a multiple choice question, to do a sum and type in the answer when we want to post to / comment on an article on a web site. And sometimes those images are quite hard to make out - indeed they seem designed to be the reverse of accessible!

Question ... "Why are the words at the bottom of the page so hard to decipher? Why are they needed at all? We are not on some Nationally sensitive site."

Everyone who runs a site that welcomes public comment needs to have some sort of protection and strategy against contributions by people who are known as "forum spammers". People who will contribute to a site, but off topic, with material at best dilutes the site and at worst causes real offence ... and they'll do it to advertise their own products. This web site you're reading at the moment has a peak traffic of over 250 visitors per hour in the middle of a weekday, and if an advertiser can sneak in his product (or, often, scam) onto a reputable site it will give it "street cred" and also help - by association - in search engine results - Search engines work along the lines of "it this is approved of by lots of reputable sites, then we should approve it more".

Does this effect even small new sites like our Melksham SCOB [Campus] site, where the question was asked? Yes - I don't think I'm giving anything away here - the very first comments were along the lines "What a fascinating site. Have you seen this probuct [link]". The obvious follow-up question is "Why not simply delete these contributions" ... but the answer is that they come too thick and fast; we have to have a mechanism that's prevention rather than cure.

There are two strategies to overcome forum spam. The first is to require all users to sign up, agree to terms and condtions, make some checks to be pretty sure that they're genuine, and then let them loose. This is what we use on a site that I look after as part of my campaign for an improved rail service for Melksham - see [here] for the registration page. It's excellent for a site where the operator anticipates regular contributions from the same people, where a continuity of submissions is useful, and where newcomers won't be too put off by the hurdles and intial wait to write their first contribution. The second is to check every post / contribution as it's made - yes, that involves repeated security checks that may be a bit irritating for the contributor - but it does get over that major hurdle of loosing a high proportion of potential contributiors because of sign up delays before they can even write anything.

That's given you an overview of why we need to protect against forum spam. The figures are huge; if you look at the Project Honeypot site you'll find figures in the millions, and if you look at the Stop Forum Spam site, you'll find that the whole front page is a list of spammers reported within the last minute!

Answering, now, the first part of the question. The words have to be hard to decipher to make it difficult for automated programs to do it - and character recognition is a very well developed science these days. If you can read it easily, then it's probable that a program can. And once you get programs generating spam, based on a pattern and sending it out to large lists of possible target sites, you're in a very interesting "game" indeed.