A (very) short intro to Hadoop

December 19, 2011

And here are the slides from the short talk on Hadoop I gave at the BigDataCamp event held in Washington DC.

A (very) short history of big data

December 19, 2011

I finally got around to posting slides from the lightening talk I gave at the BigDataCamp event held in Washington, DC this past November.

Bay Area Hadoop User Group talk

September 3, 2011

Last week I have a talk at the August HUG meetup on my current favorite topic – using search (or rather, Solr as a NoSQL solution) to improve big data analytics.

It’s the same general theme I covered at the Basis Technology conference in June – Hadoop is often used to convert petabytes of data into pie charts, but without the ability to poke at the raw data, it’s often hard to understand and validate those results.

In the good old days of small data, you could pull out spreadsheets and dive into the raw data, but that’s no longer feasible when you’re processing multi-terabyte datasets.

Solr provides a way to query data efficiently, using it as a poor man’s NoSQL key-value store. Using something like the Cascading Solr scheme we created, it’s trivial to generate a Solr index as part of the workflow. And setting up an on-demand Solr instance is also easy, so you once again have the ability to see (query/count/inspect) the data behind the curtain.

Scale Unlimited/Cascading case study posted

September 2, 2011

We’re heavy users of the Cascading open source project, which lets us quickly build Hadoop-based workflows to solve custom data processing problems.

Concurrent recently posted a Scale Unlimited Case Study that describes how we use Cascading, and the benefits to us (and thus to our customers). They also listed the various Cascading-related open source projects we sponsor, including the Solr scheme that makes it trivial to generate Solr search indexes from a scalable workflow.

I even had to create one of those classic, vacuous architectural diagrams…

Talk on using search with big data analytics

July 8, 2011

A few weeks back I was at the Basis Technology Government Users Conference in Washington, DC. It was an interesting experience, meeting people from agencies responsible for processing lots of important data. One thing I noticed is that in the Bay area, your name tag at an event tries to convey that you’re working on super-cool stuff. Here in DC, it’s more cool to be classified. For example, name tags that say “USG” – a generic term for “US Government”, and a common code term for “That’s Classified”.

My talk was about how search (at scale) is becoming a critical component of big data analytics. Without the ability to poke at the raw data, it’s very hard to validate and understand the high level results of processing lots and lots of bits down to a few graphs and tables.

Basis has published the slides here, for your reading pleasure.

Cascading Avro Tap performance

March 18, 2011
Tags: ,

Back in January, Matt Pouttu-Clarke posted his results from using the Cascading Avro tap we’d created a while back.

The most interesting result was comparing performance between parsing CSV files and reading Avro files:

Avro vs CSV parsing time

Time to parse files (shorter is better)

13.5x faster is a nice improvement over the very common practice of using text files for information exchange.

Side note: we recently released the 1.0 version, and pushed it to the Conjars repository.

Presenting at Strata Conference Tutorial on Hadoop

January 27, 2011
Tags: , ,

Strata 2011
This coming Tuesday, Feb 1st I’ll be helping at the “How to Develop Big Data Applications for Hadoop” tutorial.

My specific sections will cover the “why” of using Amazon Web Services for Hadoop (hint – scaling, simplicity, savings) and the “how” – mostly discussing the nuts and bolts of running Hadoop jobs using Elastic MapReduce. I’ll also be roaming the room during the hands-on section, helping out the attendees.

I’m looking forward to the tutorial, and also the Strata Conference itself. Lots of interesting topics, and people (like Pete Warden) that I’ve always wanted to meet.

Focused web crawling

June 18, 2010

Recently some customers have been asking for a more concrete description of how we handle “focused web crawling” at Bixo Labs.

After answering the same questions a few times, it seemed like a good idea to post details to our web site – thus the new page titled Focused Crawling.

The basic concepts are straightforward, and very similar to what we did at Krugle to efficiently find web pages that were likely to be of interest to software developers. In Bixo Labs we’ve generalized the concept a bit, and implemented it using Bixo and a Cascading workflow. This gives us a lot more flexibility when it comes to customizing the behavior, as well as making it easier for us to work with customer-provided code for extension points such as scoring pages.

Hadoop User Group Meetup Talk

April 22, 2010

Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow.

Dekel has posted the slides of my talk, as well as a (very quiet) video.

My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As part of the PTD architecture, we wound up using Amazon’s SimpleDB for storing the crawl DB, thus one section of my talk was on what we learned about using that to efficiently and inexpensively save persistent data (crawl state) while still using EMR for bursty processing. I’d previously blogged about our SimpleDB tap & scheme for Cascading, and our use of it for PTD has helped shake out some bugs.

As well, we decided to use Apache Avro for our output format. This meant creating a Cascading scheme, which would have been pretty painful but for the fortuitous, just-in-time release of Hadoop mapreduce support code in the Avro project (thanks to Doug & Scott for that). Vivek mentioned this new project in his recent blog post about our first release of PTD data, and we’re looking forward to others using this to read/write Avro files.

The real-world use case I described in my talk was analyzing the quality of the Tika charset detection, using HTML data from our initial crawl dataset. The results showed plenty of room for improvement :)

Tika accuracy detecting character sets

The real point of this use case wasn’t to point out problems with Tika, but rather to demonstrate how easy it is to use the dataset to perform this type of analysis. Which means it’s also easy to compare alternative algorithms, and improve the Tika support with a large enough dataset to inspire confidence in the end results.

As an aside, Ted Dunning might be using this data & Mahout to train a better charset and/or langauge classifier, which would be a really nice addition to the Tika project. The same thing could obviously be done for language detection, which currently also suffers from similar accuracy issues, as well as being a CPU cycle hog.

First Sample of Public Terabyte Dataset

April 21, 2010

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files.

In order to get you jump started with leveraging this dataset, we have posted a small sample of the dataset in S3 in the bixolabs-ptd-demo bucket. Along with that is the Avro JSON schema to access the file. For those unfamiliar with working with Avro files, here’s a sample snippet that illustrates one way of reading them:

Schema schema = Schema.parse(jsonSchemaFile);
DataFileReader<Object>  reader = new DataFileReader<Object>(avroFile, new GenericDatumReader<Object>(schema));
while (reader.hasNext()) {
GenericData.Record obj =  (Record) reader.next();
// You can access the fields in this object like this...

Please take a look, and let us know if there’s any missing raw content that you’d want. We’ve intentionally avoided doing post-processing of the results – this is source data for exactly that type of activity.