Home > Tech > Lies, Damn Lies, and Referrers

Lies, Damn Lies, and Referrers

February 6th, 2008

So looking through my log stats I saw a sudden spike in people coming to my page searching for “computers internet blog”. While I have pages that match those words, I can’t figure how my pagerank for that search would result in any number hits coming to my pages from that search, much less it being my leading search.

About half of the hits on my pages that don’t get automatically filtered out as robots have no referrers. Since, if a dozen people in the world have an actual bookmark to this page I’d be proud, I consider 99%+ of those to be bots I can’t identify. Also all the regular visitors tend to view the same number of pages as they have hits, which means they are aren’t downloading the .css file, or the picture of the moon up there, much less the little icons and other included files in making a webpage. It is possible they are all reading my page with lynx or some similar text only browser, but my browser stats don’t support that.

No, I expect half of the traffic I get that slips through the robot filter are robots. I like to know this, but it doesn’t bother me very much. It’s the referrers that I pay attention to. I get a list of all the pages with links that people clicked on to get to my page. Sometimes I can’t find them or they are hidden behind passwords but it seems likely they are real people clicking on real links pointing to pages of mine.

The other thing of interest is the search phrases reported by Google and other search engines. I assume those are also real people searching for real thing and ending up on one of my pages.

In comes the phrase “computers internet blog” which there is no way six people a day are coming to my page with that search. And I am right. A little Googling will indicate they are are a essentially a comment posting bot. In research I came across a line that a real Google referrer has much more stuff that is missing from these google referrers. From my logs the offensive referrer look like this:

“http://www.google.com/search?q=computers+internet+blog”

An actual Google referrer looks like:

http://www.google.com/search?hl=en&q=computers+internet+blogs&btnG=Google+Search

And can sometimes have much more stuff. I grepped through my logs looking for similar patterns.

grep "www\.google\.com/search\?q=[^&"]+” access.log

There were more than just the “computers internet blog” searches. I had searches for “nylons”, “golf cart used parts”, “shipper”, etc. In other words they were throwing off the statistics I trusted as human and more than just the “computers+internet+blog” as well. I considered my options and decided to give them the ax.

Google works fine using the url that the spammer is using but it won’t generate it itself. A sophisticated user who writes there own google urls might generate it but for the moment I’m willing to consider anyone who is using that url format as a robot (mostly because I didn’t see any legitimate (i.e. search items for which I have pagerank) use of that construct in my logs).

I added the following to the directory section of my config file, though it would work just as well in the .htaccess file.

SetEnvIfNoCase Referer "www.google.com/search?q=[^&"]+"  spammer

# Bad bot, no cookie!
Order Allow,Deny
Allow from all
Deny from env=spammer

The exact placement will depend on your config file. You want to be very careful with this, if you mess up the regular expression, you may be blocking people coming from google, which doesn’t sound like a winning strategy.

If you change the config file you’ll want to reload it:

/etc/init.d/apache2 reload

And don’t forget to test to make sure Google still works and the bad referrers are blocked.

Tech

  1. No comments yet.
  1. No trackbacks yet.