Web Miners vs Web Masters – An Uneasy Truce

November 11, 2009

The life of a webmaster is hard, and web crawlers make it harder

Angry Face

 

There’s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many web crawlers are evil.

But web crawlers serve a very real, important role in the life of a successful site, and it’s all about traffic. Without search engines like Google and Yahoo/Bing, most sites would be invisible to most users.

Implicit Contracts

An unwritten agreement exists between webmasters and web crawlers, and it reads something like this: you don’t overload my site, and you bring traffic my way. In return, I’ll give you free access lots of valuable content that I host.

And that’s worked reasonably well, for the past 15 years. Yes, there are crawlers that ignore the Robots Exclusion Standard. And there are crawlers that overload the site by hammering it with lots of simultaneous requests for hours on end. And sometimes a crawler goes a little crazy and spends hours trying to fetch non-existent pages using bogus URLs that it incorrectly derived from content on the site’s pages. For the most part, though, web crawlers try to do the Right Thing, and webmasters can always block rogue crawlers by IP address.

Web Mining != Search Index

But now you’ve got web miners – automated agents that collect data which often doesn’t wind up in a search index. And that means no traffic from searches. And thus the implicit contract has been broken.

It hasn’t happened yet, but I can see a day when many sites set up their robots.txt to allow the major search engines access, and then block everybody else.

What does this mean for the web eco-system? Three things, one for each participant:

  1. Web miners need to crawl extra-super-politely.
  2. Customers need to work with key sites to pick good crawl times.
  3. Web sites need to offer for-fee APIs for data mining.

The first point is the easiest one to solve – never hit a site with more than one simultaneous request, never fetch more than a handful of pages a minute, and respect all robots.txt restrictions.

The second is a bit harder, as it currently requires person-to-person contact with the web site in question. It’s possible to derive these “good crawl times” by varying the request rate with the response performance, so there are work-arounds. But eventually I expect to see an extension to robots.txt that lets the site owner provide additional clues to web crawlers about good and bad times for crawling.

The last point, about providing APIs, is the most long-term but also the most powerful. There are many web APIs out there, some of which provide access to valuable web data, but few offer a pay-to-play model. Most are rate limited, where you need to cut special deals if you exceed some relatively low daily threshold. Many have serious terms of use restrictions that limit a caller’s ability to actually mine the response data – often the only option is to republish it, with links/attribution back to the originating site.

What would be great is if everybody had a model like Amazon’s AWIS, where X requests cost N dollars. You can decide how much or how little to spend. There aren’t many restrictions on rate/volume or usage. And as a huge added bonus, the data comes back structured, so you don’t have to waste time hand-crafting some fragile, error-prone HTML scraping code.

And a side-note to companies thinking about the API issue – if you don’t provide one, and you block web miners, then you’ll get crawled anyway, in stealth mode by less scrupulous firms. So then everybody loses, since you’ll still be giving free access while taking a performance hit, while companies that need the data pay more to these “stealth crawlers” and get worse results.

Comments are closed.