April 3, 2013
I finally got around to posting the slides from last year’s talk I gave at Hadoop Summit. The focus of the presentation was about how we used Hadoop & Solr to solve a big data analytics problem for one of our clients. They have a web site that helps advertisers target publishers/networks and improve ad results by analyzing millions of web pages every day. They were able to cut monthly more…
September 11, 2012
We’ve just started a new project, which is to create a “planner” that lets you define & run complex workflows in GigaSpace’s XAP environment, using the Cascading API. There are lots of interesting challenges, mostly around various impedance mismatches between the Cascading/Hadoop model of data storage and parallel map-reduce execution, versus the in-memory data grid and transactional support provided by GigaSpaces. Step one has been to create a Cascading Tap more…
June 11, 2012
I’ll be speaking at the Hadoop Summit conference on Thursday (2:25pm), about how to replace Oracle (or MySQL, etc) with Hadoop + Solr. The title is “Faster, cheaper, better – switching a web site from DB queries to Hadoop & Solr“. It’s a distillation of experience with clients, where we use Hadoop to do off-line pre-processing of data, which then lets us use Solr as a NoSQL solution that provides more…
February 28, 2012
I just finished up my web mining tutorial at Strata Santa Clara 2012. It was a three hour overview of large scale web mining. The real challenge was including a hands-on lab, in spite of poor wifi connectivity, corrupt files on USB sticks, no tables, and no local file server. In the end it worked, though not without a really late night on Monday and lots of help. The slides more…
December 19, 2011
And here are the slides from the short talk on Hadoop I gave at the BigDataCamp event held in Washington DC. A (very) short intro to Hadoop View more presentations from Ken Krugler
December 19, 2011
I finally got around to posting slides from the lightening talk I gave at the BigDataCamp event held in Washington, DC this past November. A (very) short history of big data View more presentations from Ken Krugler
September 3, 2011
Last week I have a talk at the August HUG meetup on my current favorite topic – using search (or rather, Solr as a NoSQL solution) to improve big data analytics. It’s the same general theme I covered at the Basis Technology conference in June – Hadoop is often used to convert petabytes of data into pie charts, but without the ability to poke at the raw data, it’s often more…
September 2, 2011
We’re heavy users of the Cascading open source project, which lets us quickly build Hadoop-based workflows to solve custom data processing problems. Concurrent recently posted a Scale Unlimited Case Study that describes how we use Cascading, and the benefits to us (and thus to our customers). They also listed the various Cascading-related open source projects we sponsor, including the Solr scheme that makes it trivial to generate Solr search indexes more…
July 8, 2011
A few weeks back I was at the Basis Technology Government Users Conference in Washington, DC. It was an interesting experience, meeting people from agencies responsible for processing lots of important data. One thing I noticed is that in the Bay area, your name tag at an event tries to convey that you’re working on super-cool stuff. Here in DC, it’s more cool to be classified. For example, name tags more…
March 18, 2011
Back in January, Matt Pouttu-Clarke posted his results from using the Cascading Avro tap we’d created a while back. The most interesting result was comparing performance between parsing CSV files and reading Avro files: 13.5x faster is a nice improvement over the very common practice of using text files for information exchange. Side note: we recently released the 1.0 version, and pushed it to the Conjars repository.