Cascading Avro Tap performance

March 18, 2011
Tags: ,

Back in January, Matt Pouttu-Clarke posted his results from using the Cascading Avro tap we’d created a while back. The most interesting result was comparing performance between parsing CSV files and reading Avro files: 13.5x faster is a nice improvement over the very common practice of using text files for information exchange. Side note: we recently released the 1.0 version, and pushed it to the Conjars repository.

Hadoop User Group Meetup Talk

April 22, 2010

Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow. Dekel has posted the slides of my talk, as well as a (very quiet) video. My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As more…

First Sample of Public Terabyte Dataset

April 21, 2010

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files. In order to get you jump started with leveraging this dataset, we more…