Strata Web Mining Tutorial
I just finished up my web mining tutorial at Strata Santa Clara 2012. It was a three hour overview of large scale web mining. The real challenge was including a hands-on lab, in spite of poor wifi connectivity, corrupt files on USB sticks, no tables, and no local file server. In the end it worked, though not without a really late night on Monday and lots of help.
The slides are available at http://www.slideshare.net/kkrugler/strata-web-mining-tutorial
The cool thing we were able to do was to set up a large Elastic MapReduce cluster (preloaded with most of the jars we needed) that students could directly submit jobs to via the AWS API. Because we had all of the dependent jars already on the servers, the student’s job jars wound up being about 50K, so uploads worked even when on an overloaded wifi network at the conference.
And then the web crawls ran in Amazon’s cloud, which could obviously handle the load, versus us trying to do this at the event. So we’ve got a new tool we can use for future training, where having access to a real on-demand cluster is handy.