Focused web crawling

June 18, 2010

Recently some customers have been asking for a more concrete description of how we handle “focused web crawling” at Bixo Labs. After answering the same questions a few times, it seemed like a good idea to post details to our web site – thus the new page titled Focused Crawling. The basic concepts are straightforward, and very similar to what we did at Krugle to efficiently find web pages that more…

Public web crawler projects

December 2, 2009

Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I’d summarize the ones I now know about below. And if you know of others, please add your comments and I’ll update the list. Wayback Machine – A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available more…