Main Content

Information request forms, cleaning up spam

Archive - Originally posted on "The Horse's Mouth" - 2005-04-05 06:32:03 - Graham Ellis

We've been discovered! Or rather ... our brochure request form has been discovered, just like the comment submission form to this diary has been discovered, by "spam engines".

These "spam engines" locate web forms, then complete them with information about on line gaming, pharmacutical products, and other goods and services that we're not interested in. They're characterised by including a very high proportion of links - especially in text areas. I believe that they're hoping to find forms that will let them post information onto bulletin boards and other web sites ....

How to deal with this nuisance? I've amended our information request form response script to compare the length of the text entered "raw" with the length of the text entered once "href" tags are stripped out ... and if it shrinks by a third or more, it's probably a spam. It's hard to be sure, so I'm now in a testing phase that simply marks the emails sent by the brochure request system.

Code (In Perl) to accumulate the full and stripped lengths - run on each field of the form

$full_length += length($value);
$value =~ s/<a\s+href[^>]+>/ /ig;
$stripped_length += length($value);

Code that evaluates whether or not the posting is a spam

$spamfactor = $full_length / $stripped_length;
if ($spamfactor > 1.4) {
$extraword = "SPAM";
} else {
$extraword = "OK";
}

Note that I have also initialised the $full_length and $stripped_length variables to 1 not 0, in case anyone (or any automata) submits a blank form