January 27, 2011
This coming Tuesday, Feb 1st I’ll be helping at the “How to Develop Big Data Applications for Hadoop” tutorial. My specific sections will cover the “why” of using Amazon Web Services for Hadoop (hint – scaling, simplicity, savings) and the “how” – mostly discussing the nuts and bolts of running Hadoop jobs using Elastic MapReduce. I’ll also be roaming the room during the hands-on section, helping out the attendees. I’m more…
June 18, 2010
Recently some customers have been asking for a more concrete description of how we handle “focused web crawling” at Bixo Labs. After answering the same questions a few times, it seemed like a good idea to post details to our web site – thus the new page titled Focused Crawling. The basic concepts are straightforward, and very similar to what we did at Krugle to efficiently find web pages that more…
April 22, 2010
Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow. Dekel has posted the slides of my talk, as well as a (very quiet) video. My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As more…
April 21, 2010
We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files. In order to get you jump started with leveraging this dataset, we more…
March 16, 2010
Recently we’ve been running a number of large, multi-phase web mining applications in Amazon’s EC2 & Elastic MapReduce (EMR), and we needed a better way to maintain state than pushing sequence files back and forth between HDFS and S3. One option was to set up an HBase cluster, but then we’d be paying 24×7 for servers that we’d only need for a few minutes each day. We could also set more…
December 3, 2009
Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality. Out of this was born the crawler-commons project. As the main page says: The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. more…
December 2, 2009
Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I’d summarize the ones I now know about below. And if you know of others, please add your comments and I’ll update the list. Wayback Machine – A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available more…
November 16, 2009
I’m going to be giving a talk at the Bay Area ACM data mining SIG in December, and I need to finalize my topic soon – like today 🙂 I was going to expand on my Elastic Web Mining talk (“Web mining for SEO keywords”) from the ACM data mining unconference a few weeks back. But the fact that I’ll have 10s to 100s of millions of web page data more…
November 11, 2009
The life of a webmaster is hard, and web crawlers make it harder http://www.flickr.com/photos/absolutely_loverly/ / CC BY 2.0 There’s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many web crawlers are evil. But web crawlers serve a very real, important role more…
November 4, 2009
Paul posted a nice summary of my elastic web mining talk over at his blog. He captured one of the key points I was trying to make when he said: It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed to be custom coded “by hand.” That’s a recurring theme when I show workflow more…