Microsoft flooding sites with fake traffic

Arc Riley arcriley at gmail.com
Thu Feb 21 09:18:29 EST 2008


>   What's your robots.txt look like?  Does it forbid this kind of behavior?


The IPs in question never accessed robots.txt.  Only MSNBot and family on
different IP blocks.

You will see the bad behavior if you log referers.  The bots in question
claim to be arriving on your site by various search terms via
search.live.com and claim to be a normal web browser (MSIE 6 or 7).

If this was an honesty checker, which Google does, verifying that the pages
being sent to the crawlers is the same as being sent to normal web browsers,
they wouldn't claim to be arriving via search.live.com.

Many of the hits were also to pages specifically forbidden to * User-agent,
such as Disallow: /*?

My logs show 95.7% of our traffic via search engines are from Google.  MSN,
once you take all the hits from Microsoft's networks out of the equation,
only result in 0.3%.


 What's the rate like, in requests/time and bytes/time?  Are they
> flooding your site, or slowly crawling it over time?


It changes per day, but the day before yesterday 28.3%  of the pagehits were
from those subnets.  The pages they seemed to be targeting were some of the
highest CPU load.  Apache was using roughly 60% of the CPU which dropped to
2% when Microsoft was firewalled.  The server in question is an Athlon XP
2200+ with a gig of ram.


> ... please join me in blocking them ...
>
>  You block Google from indexing your site, too, then, right?


This is not a cry against crawlers.  This is a cry against deception and
bots behaving badly.



On Thu, Feb 21, 2008 at 9:02 AM, Coleman Kane <cokane at cokane.org> wrote:

>
> It's also not necessarily out of the realm of reality that their indexing
> algorithm is trying to find single keyword results. Maybe they perform
> the union/intersection of multiple search terms on their end.


That could be true, if the reported search terms had anything to do with the
content on the sites.  I could not find a single instance of any of the
search terms on our site in the earlier searches, much of which were
pornographic or sexual in nature.

The bot that generates thumbnails of the sites and grabs images is
msnbot-media
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20080221/87d448cb/attachment.html 


More information about the gnhlug-discuss mailing list