Large scale analytics using Hadoop and Solr

April 3, 2013

I finally got around to posting the slides from last year’s talk I gave at Hadoop Summit. The focus of the presentation was about how we used Hadoop & Solr to solve a big data analytics problem for one of our clients. They have a web site that helps advertisers target publishers/networks and improve ad results by analyzing millions of web pages every day. They were able to cut monthly more…

Cascading & GigaSpaces

September 11, 2012

We’ve just started a new project, which is to create a “planner” that lets you define & run complex workflows in GigaSpace’s XAP environment, using the Cascading API. There are lots of interesting challenges, mostly around various impedance mismatches between the Cascading/Hadoop model of data storage and parallel map-reduce execution, versus the in-memory data grid and transactional support provided by GigaSpaces. Step one has been to create a Cascading Tap more…

Presentation at Hadoop Summit 2012

June 11, 2012

I’ll be speaking at the Hadoop Summit conference on Thursday (2:25pm), about how to replace Oracle (or MySQL, etc) with Hadoop + Solr. The title is “Faster, cheaper, better – switching a web site from DB queries to Hadoop & Solr“. It’s a distillation of experience with clients, where we use Hadoop to do off-line pre-processing of data, which then lets us use Solr as a NoSQL solution that provides more…

Strata Web Mining Tutorial

February 28, 2012

I just finished up my web mining tutorial at Strata Santa Clara 2012. It was a three hour overview of large scale web mining. The real challenge was including a hands-on lab, in spite of poor wifi connectivity, corrupt files on USB sticks, no tables, and no local file server. In the end it worked, though not without a really late night on Monday and lots of help. The slides more…

A (very) short intro to Hadoop

December 19, 2011

And here are the slides from the short talk on Hadoop I gave at the BigDataCamp event held in Washington DC. A (very) short intro to Hadoop View more presentations from Ken Krugler

A (very) short history of big data

December 19, 2011

I finally got around to posting slides from the lightening talk I gave at the BigDataCamp event held in Washington, DC this past November. A (very) short history of big data View more presentations from Ken Krugler

Bay Area Hadoop User Group talk

September 3, 2011

Last week I have a talk at the August HUG meetup on my current favorite topic – using search (or rather, Solr as a NoSQL solution) to improve big data analytics. It’s the same general theme I covered at the Basis Technology conference in June – Hadoop is often used to convert petabytes of data into pie charts, but without the ability to poke at the raw data, it’s often more…

Scale Unlimited/Cascading case study posted

September 2, 2011

We’re heavy users of the Cascading open source project, which lets us quickly build Hadoop-based workflows to solve custom data processing problems. Concurrent recently posted a Scale Unlimited Case Study that describes how we use Cascading, and the benefits to us (and thus to our customers). They also listed the various Cascading-related open source projects we sponsor, including the Solr scheme that makes it trivial to generate Solr search indexes more…

Talk on using search with big data analytics

July 8, 2011

A few weeks back I was at the Basis Technology Government Users Conference in Washington, DC. It was an interesting experience, meeting people from agencies responsible for processing lots of important data. One thing I noticed is that in the Bay area, your name tag at an event tries to convey that you’re working on super-cool stuff. Here in DC, it’s more cool to be classified. For example, name tags more…

Cascading Avro Tap performance

March 18, 2011
Tags: ,

Back in January, Matt Pouttu-Clarke posted his results from using the Cascading Avro tap we’d created a while back. The most interesting result was comparing performance between parsing CSV files and reading Avro files: 13.5x faster is a nice improvement over the very common practice of using text files for information exchange. Side note: we recently released the 1.0 version, and pushed it to the Conjars repository.