Strata Web Mining Tutorial

February 28, 2012

I just finished up my web mining tutorial at Strata Santa Clara 2012. It was a three hour overview of large scale web mining. The real challenge was including a hands-on lab, in spite of poor wifi connectivity, corrupt files on USB sticks, no tables, and no local file server. In the end it worked, though not without a really late night on Monday and lots of help.

The slides are available at http://www.slideshare.net/kkrugler/strata-web-mining-tutorial

The cool thing we were able to do was to set up a large Elastic MapReduce cluster (preloaded with most of the jars we needed) that students could directly submit jobs to via the AWS API. Because we had all of the dependent jars already on the servers, the student’s job jars wound up being about 50K, so uploads worked even when on an overloaded wifi network at the conference.

And then the web crawls ran in Amazon’s cloud, which could obviously handle the load, versus us trying to do this at the event. So we’ve got a new tool we can use for future training, where having access to a real on-demand cluster is handy.

A (very) short intro to Hadoop

December 19, 2011

And here are the slides from the short talk on Hadoop I gave at the BigDataCamp event held in Washington DC.

A (very) short history of big data

December 19, 2011

I finally got around to posting slides from the lightening talk I gave at the BigDataCamp event held in Washington, DC this past November.

Bay Area Hadoop User Group talk

September 3, 2011
Tags:

Last week I have a talk at the August HUG meetup on my current favorite topic – using search (or rather, Solr as a NoSQL solution) to improve big data analytics.

It’s the same general theme I covered at the Basis Technology conference in June – Hadoop is often used to convert petabytes of data into pie charts, but without the ability to poke at the raw data, it’s often hard to understand and validate those results.

In the good old days of small data, you could pull out spreadsheets and dive into the raw data, but that’s no longer feasible when you’re processing multi-terabyte datasets.

Solr provides a way to query data efficiently, using it as a poor man’s NoSQL key-value store. Using something like the Cascading Solr scheme we created, it’s trivial to generate a Solr index as part of the workflow. And setting up an on-demand Solr instance is also easy, so you once again have the ability to see (query/count/inspect) the data behind the curtain.

Scale Unlimited/Cascading case study posted

September 2, 2011

We’re heavy users of the Cascading open source project, which lets us quickly build Hadoop-based workflows to solve custom data processing problems.

Concurrent recently posted a Scale Unlimited Case Study that describes how we use Cascading, and the benefits to us (and thus to our customers). They also listed the various Cascading-related open source projects we sponsor, including the Solr scheme that makes it trivial to generate Solr search indexes from a scalable workflow.

I even had to create one of those classic, vacuous architectural diagrams…

Talk on using search with big data analytics

July 8, 2011

A few weeks back I was at the Basis Technology Government Users Conference in Washington, DC. It was an interesting experience, meeting people from agencies responsible for processing lots of important data. One thing I noticed is that in the Bay area, your name tag at an event tries to convey that you’re working on super-cool stuff. Here in DC, it’s more cool to be classified. For example, name tags that say “USG” – a generic term for “US Government”, and a common code term for “That’s Classified”.

My talk was about how search (at scale) is becoming a critical component of big data analytics. Without the ability to poke at the raw data, it’s very hard to validate and understand the high level results of processing lots and lots of bits down to a few graphs and tables.

Basis has published the slides here, for your reading pleasure.

Cascading Avro Tap performance

March 18, 2011
Tags: ,

Back in January, Matt Pouttu-Clarke posted his results from using the Cascading Avro tap we’d created a while back.

The most interesting result was comparing performance between parsing CSV files and reading Avro files:

Avro vs CSV parsing time

Time to parse files (shorter is better)

13.5x faster is a nice improvement over the very common practice of using text files for information exchange.

Side note: we recently released the 1.0 version, and pushed it to the Conjars repository.

Presenting at Strata Conference Tutorial on Hadoop

January 27, 2011
Tags: , ,

Strata 2011
This coming Tuesday, Feb 1st I’ll be helping at the “How to Develop Big Data Applications for Hadoop” tutorial.

My specific sections will cover the “why” of using Amazon Web Services for Hadoop (hint – scaling, simplicity, savings) and the “how” – mostly discussing the nuts and bolts of running Hadoop jobs using Elastic MapReduce. I’ll also be roaming the room during the hands-on section, helping out the attendees.

I’m looking forward to the tutorial, and also the Strata Conference itself. Lots of interesting topics, and people (like Pete Warden) that I’ve always wanted to meet.

Focused web crawling

June 18, 2010
Tags:

Recently some customers have been asking for a more concrete description of how we handle “focused web crawling” at Bixo Labs.

After answering the same questions a few times, it seemed like a good idea to post details to our web site – thus the new page titled Focused Crawling.

The basic concepts are straightforward, and very similar to what we did at Krugle to efficiently find web pages that were likely to be of interest to software developers. In Bixo Labs we’ve generalized the concept a bit, and implemented it using Bixo and a Cascading workflow. This gives us a lot more flexibility when it comes to customizing the behavior, as well as making it easier for us to work with customer-provided code for extension points such as scoring pages.

Hadoop User Group Meetup Talk

April 22, 2010

Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow.

Dekel has posted the slides of my talk, as well as a (very quiet) video.

My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As part of the PTD architecture, we wound up using Amazon’s SimpleDB for storing the crawl DB, thus one section of my talk was on what we learned about using that to efficiently and inexpensively save persistent data (crawl state) while still using EMR for bursty processing. I’d previously blogged about our SimpleDB tap & scheme for Cascading, and our use of it for PTD has helped shake out some bugs.

As well, we decided to use Apache Avro for our output format. This meant creating a Cascading scheme, which would have been pretty painful but for the fortuitous, just-in-time release of Hadoop mapreduce support code in the Avro project (thanks to Doug & Scott for that). Vivek mentioned this new project in his recent blog post about our first release of PTD data, and we’re looking forward to others using this to read/write Avro files.

The real-world use case I described in my talk was analyzing the quality of the Tika charset detection, using HTML data from our initial crawl dataset. The results showed plenty of room for improvement 🙂

Tika accuracy detecting character sets

The real point of this use case wasn’t to point out problems with Tika, but rather to demonstrate how easy it is to use the dataset to perform this type of analysis. Which means it’s also easy to compare alternative algorithms, and improve the Tika support with a large enough dataset to inspire confidence in the end results.

As an aside, Ted Dunning might be using this data & Mahout to train a better charset and/or langauge classifier, which would be a really nice addition to the Tika project. The same thing could obviously be done for language detection, which currently also suffers from similar accuracy issues, as well as being a CPU cycle hog.