Flink-based Web Crawler Talk at Flink Forward 2018

February 19, 2018

On April 10th, at 11am, I’ll be presenting at talk at this year’s Flink Forward conference in San Francisco. What’s it about? My talk tries to answer the question “Is it possible to build an efficient, focused web crawler using Apache Flink?” It’s actually a bit deeper than that – the challenge I set was whether this could be done using ONLY Flink, without adding in additional infrastructure. Which took more…

Focused web crawling

June 18, 2010
Tags:

Recently some customers have been asking for a more concrete description of how we handle “focused web crawling” at Bixo Labs. After answering the same questions a few times, it seemed like a good idea to post details to our web site – thus the new page titled Focused Crawling. The basic concepts are straightforward, and very similar to what we did at Krugle to efficiently find web pages that more…

Public web crawler projects

December 2, 2009

Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I’d summarize the ones I now know about below. And if you know of others, please add your comments and I’ll update the list. Wayback Machine – A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available more…