Strata Web Mining Tutorial

February 28, 2012

I just finished up my web mining tutorial at Strata Santa Clara 2012. It was a three hour overview of large scale web mining. The real challenge was including a hands-on lab, in spite of poor wifi connectivity, corrupt files on USB sticks, no tables, and no local file server. In the end it worked, though not without a really late night on Monday and lots of help.

The slides are available at

The cool thing we were able to do was to set up a large Elastic MapReduce cluster (preloaded with most of the jars we needed) that students could directly submit jobs to via the AWS API. Because we had all of the dependent jars already on the servers, the student’s job jars wound up being about 50K, so uploads worked even when on an overloaded wifi network at the conference.

And then the web crawls ran in Amazon’s cloud, which could obviously handle the load, versus us trying to do this at the event. So we’ve got a new tool we can use for future training, where having access to a real on-demand cluster is handy.

One Response to “Strata Web Mining Tutorial”

  1. Great job Ken. I attended this event – it was a very effective hands-on session using files on USB sticks. We were able to submit jobs via AWS API.

    I recommend this instructor highly. All attendees were very impressed with this session.

    All future technology sessions should be done in this way, to provide real hands-on experience, instead of just slides.