Cascading & GigaSpaces

September 11, 2012

We’ve just started a new project, which is to create a “planner” that lets you define & run complex workflows in GigaSpace’s XAP environment, using the Cascading API.

There are lots of interesting challenges, mostly around various impedance mismatches between the Cascading/Hadoop model of data storage and parallel map-reduce execution, versus the in-memory data grid and transactional support provided by GigaSpaces.

Step one has been to create a Cascading Tap that lets a Hadoop-based workflow read from/write to a GigaSpaces “space”, which means one or more partitions in their data grid.

Step two is in progress, and that’s to support running real map-reduce workflows using GigaSpaces XAP.

If we’re successful, we’ll wind up with the ability to run the same workflow in Hadoop (extreme scalability, batch) and GigaSpaces (low latency, incremental) without any changes to the workflow definition.

One Response to “Cascading & GigaSpaces”

  1. Traditionally, one of XAP’s primary use cases was large scale event processing, more recently referred to as big data stream processing or real time big data analytics . Some of our users are reliably and transactionally processing up to 200K events per second, in clusters as large as a few hundreds of nodes.