Presenting at Strata Conference Tutorial on Hadoop

January 27, 2011

Tags: AWS, emr, hadoop

This coming Tuesday, Feb 1st I’ll be helping at the “How to Develop Big Data Applications for Hadoop” tutorial. My specific sections will cover the “why” of using Amazon Web Services for Hadoop (hint – scaling, simplicity, savings) and the “how” – mostly discussing the nuts and bolts of running Hadoop jobs using Elastic MapReduce. I’ll also be roaming the room during the hands-on section, helping out the attendees. I’m more…

Comments are off for this post
Filed under: Uncategorized by kkrugler

Focused web crawling

June 18, 2010

Tags: web crawler

Recently some customers have been asking for a more concrete description of how we handle “focused web crawling” at Bixo Labs. After answering the same questions a few times, it seemed like a good idea to post details to our web site – thus the new page titled Focused Crawling. The basic concepts are straightforward, and very similar to what we did at Krugle to efficiently find web pages that more…

Comments are off for this post
Filed under: Uncategorized by kkrugler

Hadoop User Group Meetup Talk

April 22, 2010

Tags: avro, cascading, elastic mapreduce, hadoop, public terabyte dataset, simpledb

Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow. Dekel has posted the slides of my talk, as well as a (very quiet) video. My talk was on the status of the Public Terabyte Dataset (PTD) project, and advice on running jobs in Amazon’s Elastic MapReduce (EMR) cloud. As more…

3 comments so far
Filed under: Uncategorized by kkrugler

First Sample of Public Terabyte Dataset

April 21, 2010

Tags: avro, cascading, public terabyte dataset

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files. In order to get you jump started with leveraging this dataset, we more…

7 comments so far
Filed under: Uncategorized by kkrugler

SimpleDB Tap for Cascading

March 16, 2010

Tags: cascading, ec2, simpledb

Recently we’ve been running a number of large, multi-phase web mining applications in Amazon’s EC2 & Elastic MapReduce (EMR), and we needed a better way to maintain state than pushing sequence files back and forth between HDFS and S3. One option was to set up an HBase cluster, but then we’d be paying 24×7 for servers that we’d only need for a few minutes each day. We could also set more…

3 comments so far
Filed under: Uncategorized by kkrugler

Crawler-commons project gets started

December 3, 2009

Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality. Out of this was born the crawler-commons project. As the main page says: The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. more…

Comments are off for this post
Filed under: Uncategorized by kkrugler

Public web crawler projects

December 2, 2009

Tags: heritrix, nutch, public terabyte dataset, web crawler

Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I’d summarize the ones I now know about below. And if you know of others, please add your comments and I’ll update the list. Wayback Machine – A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available more…

1 comment so far
Filed under: Uncategorized by kkrugler

Proposals for Big Data web mining talk

November 16, 2009

Tags: acm, data mining, mahout, public terabyte dataset

I’m going to be giving a talk at the Bay Area ACM data mining SIG in December, and I need to finalize my topic soon – like today 🙂 I was going to expand on my Elastic Web Mining talk (“Web mining for SEO keywords”) from the ACM data mining unconference a few weeks back. But the fact that I’ll have 10s to 100s of millions of web page data more…

Comments are off for this post
Filed under: Uncategorized by kkrugler

Web Miners vs Web Masters – An Uneasy Truce

November 11, 2009

Tags: polite crawling, robots, web masters, web mining

The life of a webmaster is hard, and web crawlers make it harder http://www.flickr.com/photos/absolutely_loverly/ / CC BY 2.0 There’s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many web crawlers are evil. But web crawlers serve a very real, important role more…

Comments are off for this post
Filed under: Uncategorized by kkrugler

Paul O'Rorke summary of elastic web mining talk

November 4, 2009

Tags: cascading, elastic web mining, workflow

Paul posted a nice summary of my elastic web mining talk over at his blog. He captured one of the key points I was trying to make when he said: It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed to be custom coded “by hand.” That’s a recurring theme when I show workflow more…

Comments are off for this post
Filed under: Uncategorized by kkrugler

Recent Blog Posts

Site Tags