Here's what happened. Within a couple of days of the new FTP server going on the air, the daily reports started showing intense amounts of activity from IP address 22.214.171.124. The activity took the form of someone (or, in this case, someTHING) logging into the server as 'anonymous,' waiting one or two seconds, and then simply logging right back out again.
Normally, I wouldn't have looked twice at such a thing. However, this particular login was repeating itself every few SECONDS, 24 hours every day, seven days a week.
After said week, I started looking in further detail, starting with a DNS lookup on the IP. Imagine my surprise when I found this...
Canonical name: crawl-66-249-64-23.googlebot.com
My initial assumption, giving Google the benefit of the doubt, was that they had suffered a breakdown of one of their crawl-bots, and that it was 'bouncing' the server login instead of actually crawling the directories. With that in mind, I used FreeBSD's built-in IPFilter to block that particular address (which, after the first week, had racked up a little over 1,400 login attempts).
I then wrote a note to Google (firstname.lastname@example.org), advising them that their 'bot appeared broken to the point of being abusive, and that I had blocked the single IP address as a result. I said that I considered the activity to be network abuse, and politely requested that they prevent their 'bot from returning until it was better behaved.
I assumed (foolishly, in retrospect) that this would be the end of it. Not so! A day later, I got back a boilerplate from Google, thanking me for my concern over the activities of their crawl-bots, and advising me that I could block it out with a 'robots.txt' file on my site.
That's all well and good, except for one thing: FTP sites are not web sites. FTP sites don't use 'robots.txt.' And, to make matters worse, the abusive behavior resumed -- from a different IP address in the same range!
By now, as you might imagine, I was getting just a bit torqued off. A little research with publicly-available 'whois' records gave me the troublesome IP range as shown below.
Trying 126.96.36.199 at ARIN
Trying 66.249.64 at ARIN
OrgName: Google Inc.
Address: 1600 Amphitheatre Parkway
City: Mountain View
NetRange: 188.8.131.52 - 184.108.40.206
NetType: Direct Allocation
OrgTechName: Google Inc.
The gloves came off at this point. I sent Google a note that was somewhat less than polite, demanding that they cease their abuse of my server, and that I was now blocking their entire /19 subnet.
Guess what I got back? A near-exact copy of the same stupid boilerplate, acknowledging my "concerns" and advising me to use a "robots.txt" file.
The abuse continued, though it failed to reach any of our servers. I was upset enough that I not only blocked the /19 in the FTP server's IPF rules file, I also blocked it at our border router (which kept it out of our web server) AND in IPF rules in both of our DNS servers as well.
I continued to monitor the situation to see how their crawl-bot would respond to hitting an electronic brick wall. I assumed (again, foolishly) that it would try a few times, and then simply give up and move on to open networks.
I could not have been more wrong. Much to my surprise, the 'bot continued to batter itself against our newly-installed shielding, if anything with even more frequency. I let this go on for about a month, at which point the IPF statistics showed the following (the timestamp is in PDT).
gutenberg.bluefeathertech.com ipf denied packets:
+++ /tmp/security.x37Xh47R Wed Oct 11 03:01:12 2006
+15161 @2 block in quick on fxp1 from 220.127.116.11/19 to any
So, in a four-week period stretching back from Oct. 11th, 2006, there were 15,161 attempts by one or more Google crawl-bots to break through the filter. This breaks down to 3,790 attempts per week (15,161 / 4).
Taking that down farther gives us 541 attempts per DAY (3,790 / 7). That's a lot of attempts, no matter how you slice it.
I wrote to Google again, after this time had gone by, to see if they had changed their tune. This time, I accused them of being on the borderline of a DoS attack. The initial response was the same dumb boilerplate I'd gotten before, about 'robots.txt,' so I threw it back at them along with an explanation that FTP servers did not use such files, and that they could either stop attacking our network or risk a whole slug of bad publicity.
This, apparently, was enough to convince them to knock it off, at least where our FTP server is concerned. I got back a handwritten reply, apologizing for the hassle and saying that they had excluded our FTP server, both in IP address and in DNS references, from their crawler's 'To Do' list.
Sure enough, the onslaught on the FTP side stopped that same day, and it has not returned.
HOWEVER -- The assault continues to this day, completely unabated, on our web server and border router. Here are the stats, fresh from this morning (note the timestamp), from our DNS servers.
ns0.bluefeathertech.com ipf denied packets:
+++ /tmp/security.h7Wac85G Sun Oct 22 03:01:01 2006
+230 @2 block in quick on fxp1 from 18.104.22.168/19 to any
ns1.bluefeathertech.com ipf denied packets:
+++ /tmp/security.ZRZDqCQI Sun Oct 22 03:01:01 2006
+218 @2 block in quick on fxp1 from 22.214.171.124/19 to any
These reports run every 24 hours, and I never told Google to stay away from our web server, so I think it is safe to assume that their crawler is going to continue to try at least 200-some-odd times per day to hit us. I also continue to see hundreds of attempts in our border router's 'Deny' log.
Suffice to say that Google's "do no evil" claim should be taken with a bag of rock salt. I would urge anyone who runs a web or FTP server to closely monitor it for abusive amounts of activity coming from Google's crawler subnet, and simply block it out if you don't want them around. However, I am going to lift our block for the web side, for at least a week or two, so that this page will be crawled and indexed for all to find.
REMEMBER -- Your servers, your network, your bandwidth, your rules! Do NOT, under ANY conditions, feel obligated to subject your equipment to Google's abusive crawling, no matter what kind of spin they may put on it!
(Last Update: 22-Oct-06)