Hadoop User Group Meetup Talk

April 22, 2010

Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow. Dekel has posted the slides of my talk, as well as a (very quiet) video. My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As more…

First Sample of Public Terabyte Dataset

April 21, 2010

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files. In order to get you jump started with leveraging this dataset, we more…

Public web crawler projects

December 2, 2009

Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I’d summarize the ones I now know about below. And if you know of others, please add your comments and I’ll update the list. Wayback Machine – A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available more…

Proposals for Big Data web mining talk

November 16, 2009

I’m going to be giving a talk at the Bay Area ACM data mining SIG in December, and I need to finalize my topic soon – like today 🙂 I was going to expand on my Elastic Web Mining talk (“Web mining for SEO keywords”) from the ACM data mining unconference a few weeks back. But the fact that I’ll have 10s to 100s of millions of web page data more…

Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project. This is a high quality crawl of top web sites, using AWS’s Elastic MapReduce, Concurrent’s Cascading workflow API, and Bixo Lab’s elastic web mining platform. Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users. In addition, the code used to create and process the dataset will be available for download more…