Cascading Avro Tap performance

March 18, 2011

Tags: avro, cascading

Back in January, Matt Pouttu-Clarke posted his results from using the Cascading Avro tap we’d created a while back. The most interesting result was comparing performance between parsing CSV files and reading Avro files: 13.5x faster is a nice improvement over the very common practice of using text files for information exchange. Side note: we recently released the 1.0 version, and pushed it to the Conjars repository.

Comments are off for this post
Filed under: Uncategorized by kkrugler

Hadoop User Group Meetup Talk

April 22, 2010

Tags: avro, cascading, elastic mapreduce, hadoop, public terabyte dataset, simpledb

Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow. Dekel has posted the slides of my talk, as well as a (very quiet) video. My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As more…

3 comments so far
Filed under: Uncategorized by kkrugler

First Sample of Public Terabyte Dataset

April 21, 2010

Tags: avro, cascading, public terabyte dataset

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files. In order to get you jump started with leveraging this dataset, we more…

7 comments so far
Filed under: Uncategorized by kkrugler

Cascading Avro Tap performance

Hadoop User Group Meetup Talk

First Sample of Public Terabyte Dataset

Recent Blog Posts

Site Tags