Save the Forum - A regular clean sweep
Archive - Originally posted on "The Horse's Mouth" - 2007-05-17 23:37:36 - Graham EllisWith many visitors and a great deal of exposure, our Save The Train web site gets the attention of unwelcome content providers - people who will come on to our forum or blog and post articles and comments that are way off topic. Why do they do it? Primarily to sell their pharmaceutical products, loans, betting schemes to the search engines - to get themselves ranking on our good name and popularity. Unfortunately, such posts also dilute our content, lower our ranking and at times shock and offend some of our readers. How to "solve" the problem [on the forum]?
We COULD go for a manually authorised signup procedure and (at the levels we're looking at) the three moderators of the forum could cope with this. But it adds an extra hurdle into the loop for newcomers and it's likely to put them off having to wait, perhaps a few hours, before they can make their first post.
We COULD use a captcha scheme where the new arrival has to retype a series of letters - great against the "autobots" but more and more of these signups are made by paid workers in low-wage parts of the world - kids there doing it for minimal pocket money.
We COULD add a filter in to refuse messages as they're posted which match a pattern that we want to reject - but the posters would know straight away that their payload had not been placed, and would be flagged to look for alternatives.
So what's the solution? There's no "100% solution" that I know of, but I have implemented a "clean sweep" systems that goes around the boards from time to time, deleting posts which conform to certain criteria. It's run automatically under "crontab" so there no need for any interaction of my / our administrator's part. It's been tuned to err on the side of saftey - in other words, any genuine newcomer is highly unlikely to have his / her first post killed. And it means that our board-spammers leave thinking that they have successfully delivered their payload.
If anyone would like to use the algorithm on their own board ... here's my SQL that finds the rogue posts. It would, mind you, need individual tuning.
select id_msg, smf_messages.id_member, posts, totalTimeLoggedIn, membername from smf_messages left join smf_members on smf_messages.id_member = smf_members.id_member where posts < 2 and (body like "%[url%[url%" or body like "%href%href%" ) and body not like "%train%" and body not like "%wilts%" and body not like "%station%" and body not like "%swindon%" and id_msg > 5000 order by id_msg
Disadvantages?
* A few spam messages make it through and still need manual deletes
* Users will see occasional recent spams before they are deleted
* The "latest post" for each board isn't recalculated; a good clue to us "in the know" that we have trapped a spam post, but perhaps a "bug" to users
* Rare chance of deleting a genuine post.