A (very) short history of big data

December 19, 2011

I finally got around to posting slides from the lightening talk I gave at the BigDataCamp event held in Washington, DC this past November.

Bay Area Hadoop User Group talk

September 3, 2011

Last week I have a talk at the August HUG meetup on my current favorite topic – using search (or rather, Solr as a NoSQL solution) to improve big data analytics.

It’s the same general theme I covered at the Basis Technology conference in June – Hadoop is often used to convert petabytes of data into pie charts, but without the ability to poke at the raw data, it’s often hard to understand and validate those results.

In the good old days of small data, you could pull out spreadsheets and dive into the raw data, but that’s no longer feasible when you’re processing multi-terabyte datasets.

Solr provides a way to query data efficiently, using it as a poor man’s NoSQL key-value store. Using something like the Cascading Solr scheme we created, it’s trivial to generate a Solr index as part of the workflow. And setting up an on-demand Solr instance is also easy, so you once again have the ability to see (query/count/inspect) the data behind the curtain.

Scale Unlimited/Cascading case study posted

September 2, 2011

We’re heavy users of the Cascading open source project, which lets us quickly build Hadoop-based workflows to solve custom data processing problems.

Concurrent recently posted a Scale Unlimited Case Study that describes how we use Cascading, and the benefits to us (and thus to our customers). They also listed the various Cascading-related open source projects we sponsor, including the Solr scheme that makes it trivial to generate Solr search indexes from a scalable workflow.

I even had to create one of those classic, vacuous architectural diagrams…

Talk on using search with big data analytics

July 8, 2011

A few weeks back I was at the Basis Technology Government Users Conference in Washington, DC. It was an interesting experience, meeting people from agencies responsible for processing lots of important data. One thing I noticed is that in the Bay area, your name tag at an event tries to convey that you’re working on super-cool stuff. Here in DC, it’s more cool to be classified. For example, name tags that say “USG” – a generic term for “US Government”, and a common code term for “That’s Classified”.

My talk was about how search (at scale) is becoming a critical component of big data analytics. Without the ability to poke at the raw data, it’s very hard to validate and understand the high level results of processing lots and lots of bits down to a few graphs and tables.

Basis has published the slides here, for your reading pleasure.

Cascading Avro Tap performance

March 18, 2011
Tags: ,

Back in January, Matt Pouttu-Clarke posted his results from using the Cascading Avro tap we’d created a while back.

The most interesting result was comparing performance between parsing CSV files and reading Avro files:

Avro vs CSV parsing time

Time to parse files (shorter is better)

13.5x faster is a nice improvement over the very common practice of using text files for information exchange.

Side note: we recently released the 1.0 version, and pushed it to the Conjars repository.

Presenting at Strata Conference Tutorial on Hadoop

January 27, 2011
Tags: , ,

Strata 2011
This coming Tuesday, Feb 1st I’ll be helping at the “How to Develop Big Data Applications for Hadoop” tutorial.

My specific sections will cover the “why” of using Amazon Web Services for Hadoop (hint – scaling, simplicity, savings) and the “how” – mostly discussing the nuts and bolts of running Hadoop jobs using Elastic MapReduce. I’ll also be roaming the room during the hands-on section, helping out the attendees.

I’m looking forward to the tutorial, and also the Strata Conference itself. Lots of interesting topics, and people (like Pete Warden) that I’ve always wanted to meet.

Focused web crawling

June 18, 2010

Recently some customers have been asking for a more concrete description of how we handle “focused web crawling” at Bixo Labs.

After answering the same questions a few times, it seemed like a good idea to post details to our web site – thus the new page titled Focused Crawling.

The basic concepts are straightforward, and very similar to what we did at Krugle to efficiently find web pages that were likely to be of interest to software developers. In Bixo Labs we’ve generalized the concept a bit, and implemented it using Bixo and a Cascading workflow. This gives us a lot more flexibility when it comes to customizing the behavior, as well as making it easier for us to work with customer-provided code for extension points such as scoring pages.

Hadoop User Group Meetup Talk

April 22, 2010

Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow.

Dekel has posted the slides of my talk, as well as a (very quiet) video.

My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As part of the PTD architecture, we wound up using Amazon’s SimpleDB for storing the crawl DB, thus one section of my talk was on what we learned about using that to efficiently and inexpensively save persistent data (crawl state) while still using EMR for bursty processing. I’d previously blogged about our SimpleDB tap & scheme for Cascading, and our use of it for PTD has helped shake out some bugs.

As well, we decided to use Apache Avro for our output format. This meant creating a Cascading scheme, which would have been pretty painful but for the fortuitous, just-in-time release of Hadoop mapreduce support code in the Avro project (thanks to Doug & Scott for that). Vivek mentioned this new project in his recent blog post about our first release of PTD data, and we’re looking forward to others using this to read/write Avro files.

The real-world use case I described in my talk was analyzing the quality of the Tika charset detection, using HTML data from our initial crawl dataset. The results showed plenty of room for improvement :)

Tika accuracy detecting character sets

The real point of this use case wasn’t to point out problems with Tika, but rather to demonstrate how easy it is to use the dataset to perform this type of analysis. Which means it’s also easy to compare alternative algorithms, and improve the Tika support with a large enough dataset to inspire confidence in the end results.

As an aside, Ted Dunning might be using this data & Mahout to train a better charset and/or langauge classifier, which would be a really nice addition to the Tika project. The same thing could obviously be done for language detection, which currently also suffers from similar accuracy issues, as well as being a CPU cycle hog.

First Sample of Public Terabyte Dataset

April 21, 2010

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files.

In order to get you jump started with leveraging this dataset, we have posted a small sample of the dataset in S3 in the bixolabs-ptd-demo bucket. Along with that is the Avro JSON schema to access the file. For those unfamiliar with working with Avro files, here’s a sample snippet that illustrates one way of reading them:

Schema schema = Schema.parse(jsonSchemaFile);
DataFileReader<Object>  reader = new DataFileReader<Object>(avroFile, new GenericDatumReader<Object>(schema));
while (reader.hasNext()) {
GenericData.Record obj =  (Record) reader.next();
// You can access the fields in this object like this...

Please take a look, and let us know if there’s any missing raw content that you’d want. We’ve intentionally avoided doing post-processing of the results – this is source data for exactly that type of activity.

SimpleDB Tap for Cascading

March 16, 2010

Recently we’ve been running a number of large, multi-phase web mining applications in Amazon’s EC2 & Elastic MapReduce (EMR), and we needed a better way to maintain state than pushing sequence files back and forth between HDFS and S3.

One option was to set up an HBase cluster, but then we’d be paying 24×7 for servers that we’d only need for a few minutes each day. We could also set up MySQL with persistent storage on an Amazon EBS volume, but then we’d have to configure & launch MySQL on our cluster master for each mining “loop”, and for really big jobs it wouldn’t scale well to 100M+ items.

So we spent some time creating a Cascading tap & scheme that lets us use Amazon’s SimpleDB to maintain the state of web mining & crawling jobs. It’s still pretty rough, but usable. The code is publicly available at GitHub – check out http://github.com/bixolabs/cascading.simpledb.

There’s also a README to help you get started, which I’ve copied below since it contains useful information about the project.

As to the big question on performance – not sure yet how well it handles a SimpleDB with 100M+ entries, but we’re heading there fast on one project, so more details to follow. Enjoy…



cascading.simpledb is a Cascading Tap & Scheme for Amazon’s SimpleDB.

This means you can use SimpleDB as the source of tuples for a Cascading
flow, and as a sink for saving results. This is particularly useful when
you need a scalable, persistent store for Cascading jobs being run in
Amazon’s EC2/EMR cloud environment.

Information about SimpleDB is available from http://aws.amazon.com/simpledb/
and also http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/

Note that you will need to be signed up to use both AWS and SimpleDB, and
have valid AWS access key and secret key values before using this code. See


In order to get acceptable performance, the cascading.simpledb scheme splits
each virtual “table” of data in SimpleDB across multiple shards. A shard
corresponds to what SimpleDB calls a domain. This allows most requests to
be run in parallel across multiple mappers, without having to worry about
duplicate records being returned for the same request.

Each record (Cascading tuple, or SimpleDB item) has an implicit field called
SimpleDBUtils-itemHash. This is a zero-padded hash of the record’s key or item
value. This is another SimpleDB concept – every record has a unique key, used
to read it directly.

Records (items) are split between shards using partitions of this hash value. This
implies that once a table has been created and populated with items, there is no
easy way to change the number of shards; you essentially have to build a new
table and copy all of the values.

The implicit itemHash field could also be used to parallelize search requests within
a single shard, by further partitioning. This performance boost is not yet
implemented by cascading.simpledb, however.


  // Specify fields in SimpleDB item, and which field is to be used as the key.
  Fields itemFields = new Fields("name", "birthday", "occupation", "status");
  SimpleDBScheme sourceScheme = new SimpleDBScheme(itemFields, new Fields("name"));
  // Load people that haven't yet been contacted, up to 1000
  String query = "`status` = \"NOT_CONTACTED\"";
  int numShards = 20;
  String tableName = "people"
  Tap sourceTap = new SimpleDBTap(sourceScheme, accessKey, secretKey, tableName, numShards);
  Pipe processingPipe = new Each("generate email pipe", new SendEmailOperation());
  // Use the same scheme as the source - query & limit are ignored
  Tap sinkTap = new SimpleDBTap(sourceScheme, accessKey, secretKey, tableName, numShards);
  Flow flow = new FlowConnector().connect(sourceTap, sinkTap, processingPipe);


All limitations of the underlying SimpleDB system obviously apply. That means
things like the maximum number of shards (100), the maximum size of any one
item (1MB), the maximum size of any field value (1024) and so on. See details
at http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/SDBLimits.html

Given the 1K max field value length, SimpleDB is most useful for storing small
chunks of data, or references to bigger blobs that can be saved in S3.

In addition, all values are stored as strings. This means all fields must be
round-trippable as text.

Finally, SimpleDB does not guarantee immediate consistencycan guarantee consistency when reading back
the results of a write (or doing queries for the same), but this imposes a significant performance penalty, so the tap uses the default “eventually consistent” mode. This typically isn’t a problem due to the batch-oriented nature of most Cascading workflows, but could be an issue if you had multiple jobs writing to and reading from the same table.

Known Issues

Currently you need to correctly specify the number of shards for a table when
you define the tap. This is error prone, and only necessary when creating (or
re-creating) the table from scratch.

Tuple values that are null will not be updated in the table, which means you
can’t delete values, only add or update them.

Some operations are not multi-threaded, and thus take longer than they should.
For example, calculating the splits for a read will make a series of requests
to SimpleDB to get the item counts.

Numeric fields should automatically be stored as zero-padded strings to ensure
proper sort behavior, but currently this is only done for the implicit hash field.


You need Apache Ant 1.7 or higher, and a git client.

1. Download source from GitHub

% git clone git://github.com/bixolabs/cascading.simpledb.git
% cd cascading.simpledb

2. Set appropriate credentials for testing

% edit src/test/resources/aws-keys.txt

Enter valid AWS access key and secret key values for the two corresponding properties.

3. Build the jar

% ant clean jar

or to build and install the jar in your local Maven repo:

% ant clean install

4. Create Eclipse project files

% ant eclipse

Then, from Eclipse follow the standard procedure to import an existing Java project into your Workspace.