I'm really starting to get annoyed with all the spiders I am getting through my site, simply so that they can put referrers in my web logs.
It wouldn't be such a big deal, considering it's very easy for me to pick out which referrers are real and which aren't pretty fast. The issue comes from the fact that they are badly written, so spider my links badly, and send URL requests that send me about 10-15 error message emails a day!
They all use a Internet Explorer User Agent, so I can't stop them all that way, I've tried banning people by IP, but they change that often it't almost not worth it.
I'm half considering puting some sort of CAPTCHA on my site just to kill these bots… but that would stop every other good bot too (i.e. Google).
Any thoughts anyone?
Btw – sorry about the lack of posts, I'm currently in the process of rewriting CT using a new OO framework idea I've been toying around with, but more on that later…
Comments
Maybe this will help?
http://www.digitalmediaminute.com/article/1167/blocking-referer-spam
Thanks for that!
I’m not on Apache, so .htaccess isn’t an option. But I’ve written some code that look for a bad referrer and pushes it out to
http://www.compoundtheory.com/banned.html
It should do the trick quite nicely I think.
what about ya get a list of good bots IP’s and do a simple
if ip in good bots list ignore
if not captcha
Use the CAPTCHA, but allow the good spiders.
Basically:
if(capthchaisgood)or(spider in [good spiders])
allow the crawl
else
bye-bye