Hadoop User Group Meetup Talk
Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow.
My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As part of the PTD architecture, we wound up using Amazon’s SimpleDB for storing the crawl DB, thus one section of my talk was on what we learned about using that to efficiently and inexpensively save persistent data (crawl state) while still using EMR for bursty processing. I’d previously blogged about our SimpleDB tap & scheme for Cascading, and our use of it for PTD has helped shake out some bugs.
As well, we decided to use Apache Avro for our output format. This meant creating a Cascading scheme, which would have been pretty painful but for the fortuitous, just-in-time release of Hadoop mapreduce support code in the Avro project (thanks to Doug & Scott for that). Vivek mentioned this new project in his recent blog post about our first release of PTD data, and we’re looking forward to others using this to read/write Avro files.
The real-world use case I described in my talk was analyzing the quality of the Tika charset detection, using HTML data from our initial crawl dataset. The results showed plenty of room for improvement
The real point of this use case wasn’t to point out problems with Tika, but rather to demonstrate how easy it is to use the dataset to perform this type of analysis. Which means it’s also easy to compare alternative algorithms, and improve the Tika support with a large enough dataset to inspire confidence in the end results.
As an aside, Ted Dunning might be using this data & Mahout to train a better charset and/or langauge classifier, which would be a really nice addition to the Tika project. The same thing could obviously be done for language detection, which currently also suffers from similar accuracy issues, as well as being a CPU cycle hog.