Public web crawler projects

December 2, 2009

Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I’d summarize the ones I now know about below. And if you know of others, please add your comments and I’ll update the list.

  • Wayback Machine – A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available in raw format AFAIK. The work is part of the Internet Archive organization, and uses Heritrix for crawling.
  • CDL Web Archiving Service – The California Digital Library provides the Web Archiving Service to enable librarians and scholars to create archives of captured web sites and publications. Similar to the Wayback Machine, they use Heritrix and other software from the Internet Archive, and the results are searchable but not available in raw format.
  • CommonCrawl – Their goal is to build, maintain and make widely available a comprehensive crawl of the Internet. They use Nutch (useragent is ccBot). I’ve seen Ahad Rana post to the Nutch list. So far I haven’t seen any actual search or raw data results from this project. The do have a cool public “crawl stats” page.
  • UK Web Archive – A “Wayback Machine” for UK web sites. Provided by the British Library. Searchable, but no raw data that I can see. They in turn sponsor the Web Curator Tool, which is an open-source workflow management application for selective web archiving (driver for Heritrix).
  • Isara Search – A project sponsored by Isara Charity Organization to build the world’s first non-profit search engine. Based in Thailand, using Nutch. No search/data available yet.
  • ClueWeb09 – The ClueWeb09 dataset was created by the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The data is available to researchers who sign a legal agreement and pay $750 for the hard disks needed to store the data.
  • WebBase – The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. The useragent is WebVac (was Pita). There’s also a web GUI for fetching specific crawl sets.
  • Laboratory for Web Algorithmics – Uses UbiCrawler to create large-scale link graph datasets that can be freely downloaded.

One Response to “Public web crawler projects”

  1. There are also the Web As Corpus datasets, which are available upon request. They are crawls of the .uk domain, as well as .de, .jp, and a few others IIRC. Some of them have been dependency parsed.