Hadoop User Group Meetup Talk

April 22, 2010

Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow.

Dekel has posted the slides of my talk, as well as a (very quiet) video.

My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As part of the PTD architecture, we wound up using Amazon’s SimpleDB for storing the crawl DB, thus one section of my talk was on what we learned about using that to efficiently and inexpensively save persistent data (crawl state) while still using EMR for bursty processing. I’d previously blogged about our SimpleDB tap & scheme for Cascading, and our use of it for PTD has helped shake out some bugs.

As well, we decided to use Apache Avro for our output format. This meant creating a Cascading scheme, which would have been pretty painful but for the fortuitous, just-in-time release of Hadoop mapreduce support code in the Avro project (thanks to Doug & Scott for that). Vivek mentioned this new project in his recent blog post about our first release of PTD data, and we’re looking forward to others using this to read/write Avro files.

The real-world use case I described in my talk was analyzing the quality of the Tika charset detection, using HTML data from our initial crawl dataset. The results showed plenty of room for improvement :)

Tika accuracy detecting character sets

The real point of this use case wasn’t to point out problems with Tika, but rather to demonstrate how easy it is to use the dataset to perform this type of analysis. Which means it’s also easy to compare alternative algorithms, and improve the Tika support with a large enough dataset to inspire confidence in the end results.

As an aside, Ted Dunning might be using this data & Mahout to train a better charset and/or langauge classifier, which would be a really nice addition to the Tika project. The same thing could obviously be done for language detection, which currently also suffers from similar accuracy issues, as well as being a CPU cycle hog.

3 Responses to “Hadoop User Group Meetup Talk”

  1. I’m currently using n-grams for combined encoding/language detection – it works pretty good, but not so fast. The faster method is detection of charset first (using different methods, but could many problems with detection of iso-8859-*, cp*, windows-* encodings) and after this normalize text and detect language using n-grams (as optimization we could use information about charsets – this will allow to narrow list of possible languages to smaller size (except utf-*))

  2. Hi Alex,

    Thanks for the input – do you have any references to your code?

    And yes, I agree that using the charset to help improve language detection makes a lot of sense. Anything encoded with GB2312 is very, very likely to be Chinese, not Korean or Japanese :)

    As a side point, Ted Dunning has a good paper about using LLR (log-likelihood ratio) to both select the most important n-grams, and when calculating similarity scores. As he’d noted in the past, using Pearson’s Distance has problems with commonly occurring n-grams causing large skews in scoring.

    See https://issues.apache.org/jira/browse/TIKA-369 for the issue I filed to track this. It also has some of Ted’s papers as attachments.

    – Ken

  3. I have currently no open code for this task, but I plan to implement it in near future (although I’m using Clojure for my tasks – but we can rewrite it to Java when it will ready)