Open Source Projects

At Scale Unlimited, we use a lot of open source software. And we contribute back to the community, via the following projects:


Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications.We are the primary contributors to the project.


cascading.avro is a Cascading Scheme for the Apache Avro data serialization format. Using this scheme, you can easily use Avro files as both input and output formats for your Hadoop jobs.


cascading.solr is a Cascading Scheme for Solr. Using this scheme, you can easily generate Solr-compatible Lucene indexes from Hadoop jobs.


cascading.simpledb is a Cascading Tap & Scheme for Amazon’s SimpleDB.


cascading.utils is a set of utilities for Cascading workflows. For example there are classes that wrap Cascading Tuples with “datum” objects, utility classes such as TupleLogger and SplitterAssembly, and classes to help monitor running workflows.


The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. Ken Krugler is a committer for the Tika project.

One of the ways that we’ve contributed back is by integrating Boilerpipe into Tika, thus making it easy to extra “core text” from HTML pages.