Microsoft flooding sites with fake traffic

Wed Feb 20 21:34:53 EST 2008

On Wed, Feb 20, 2008 at 4:46 PM, Arc Riley <arcriley at gmail.com> wrote:
> Do yourselves a favor and search your logs for connections from 131.107.*
> 65.52.* 65.53.* 65.54.* and 65.55.*

  On the GNHLUG web server in /var/log/httpd/ ...

liberty$ find -name access_log\* | xargs egrep '^(131\.107|65.5[2-5])' | wc -l
14293
liberty$ find -name access_log\* | xargs cat | wc -l
185492

  We keep logs going back a month, rotated weekly.

  We don't log referrals or user agents on the GNHLUG server.  Maybe we should.

> All of it, well 97.2%, from the above two subnets, belonging to Microsoft.

  Interesting.  And a relatively small number of unique hosts (152),
given that there are five /16's in question (327680, give or take).

liberty$ find -name access_log\* | xargs egrep -h
'^(131\.107|65.5[2-5])' | awk '{ print $1 }' | sort -u > /tmp/hostlist
liberty$ wc -l /tmp/hostlist
152 /tmp/hostlist

  The IP address DNS reverse to a few different things.  Matching
regexp patterns would be:

NXDOMAIN
tide[0-9]+.microsoft.com
b[ly]1sch[0-9]+.phx.gbl
livebot-65-55-[0-9]+-[0-9]+.search.live.com

  I also took a look at unique URLs:

liberty$ find -name access_log\* | xargs egrep -h
'^(131\.107|65.5[2-5])' | awk -F\" '{ print $2 }' | awk '{ print $2 }'
| sort -u > /tmp/urls
liberty$ wc -l /tmp/urls
7237 /tmp/urls

  The URLs themselves... hmmm, hard to know for sure with our site,
but it looks to me like something is walking the entire site,
following every link, including TWiki's search, history, and edit
links.  Think "wget -r".

> I can't tell if MS is trying to skew the statistics in favor of MSIE/Live/etc
> or if it's conducting a denial of service attack against free software
> project sites ...

  Those two both seem rather unlikely.  In particular, remember
Hanlon's razor.   My guess is some kind of crawler robot.  Possibly a
malfunctioning and/or poorly-designed one.  (That was my guess before
I started digging into logs, by the way.  I also guessed maybe a
botnet, but the small number of requesting hosts make that less
likely.)

  Our numbers show that as roughly 8% of our traffic, by object
request count.  In the distant past, when we did some traffic
analysis, the bulk of the traffic hitting the GNHLUG site was crawler
robots.  So that doesn't seem out of line.

  What's your robots.txt look like?  Does it forbid this kind of behavior?

  What's the rate like, in requests/time and bytes/time?  Are they
flooding your site, or slowly crawling it over time?

> ... please join me in blocking them ...

  You block Google from indexing your site, too, then, right?

-- Ben