Web Miners vs Web Masters – An Uneasy Truce

November 11, 2009

The life of a webmaster is hard, and web crawlers make it harder http://www.flickr.com/photos/absolutely_loverly/ / CC BY 2.0   There’s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many web crawlers are evil. But web crawlers serve a very real, important role more…

Elastic Web Mining Talk

November 2, 2009

Here’s the presentation I gave at the ACM data mining unconference on elastic web mining – how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon’s Elastic Compute Cloud (EC2). [slideshare id=2407600&doc=acmuctalk-091102194640-phpapp02] But I don’t see my notes showing up, so here’s the PDF version with full notes, which make the resulting slides a lot more meaningful. [slideshare more…

Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project. This is a high quality crawl of top web sites, using AWS’s Elastic MapReduce, Concurrent’s Cascading workflow API, and Bixo Lab’s elastic web mining platform. Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users. In addition, the code used to create and process the dataset will be available for download more…

Presenting at 2009 Silicon Valley Data Mining Camp

October 30, 2009

This coming Sunday is the big Bay Area data mining “unconference“, and with more than 200 people already signed up, it’s going to be a lot of fun. I’ll be presenting at some point during the day – since it’s an unconference, you don’t really know who’s going to be talking about what/when. My topic is “Elastic web mining using open source (Hadoop/Cascading/Bixo) in Amazon’s EC2 cloud“. If you scan more…

Bixolabs Less Stealthy

October 27, 2009

It’s time to raise the curtain a bit on our new web mining platform. We’re currently running test crawls for early partners, and tuning up the GUI for the Bixolabs admin console. In the meantime, we’ll be adding more details to this site about web mining in general, and how Bixolabs deals with some of the very unusual issues you run into while crawling the web (video poker link farms more…