Flink-based Web Crawler Talk at Flink Forward 2018

February 19, 2018

On April 10th, at 11am, I’ll be presenting at talk at this year’s Flink Forward conference in San Francisco.

What’s it about? My talk tries to answer the question “Is it possible to build an efficient, focused web crawler using Apache Flink?” It’s actually a bit deeper than that – the challenge I set was whether this could be done using ONLY Flink, without adding in additional infrastructure.

Which took us down some interesting rabbit holes, including fun with iterations, async functions, custom state management, and beating the crap out of the Common Crawl datasets hosted by AWS on S3.

I hope to see some of you there, I promise to make the talk short, entertaining and informative. As opposed to this diagram of the crawler’s workflow…

flink-crawler DAG

Comments are closed.