Open Source Projects
Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications.We are the primary contributors to the project.
cascading.simpledb is a Cascading Tap & Scheme for Amazon’s SimpleDB.
cascading.utils is a set of utilities for Cascading workflows. For example there are classes that wrap Cascading Tuples with “datum” objects, utility classes such as TupleLogger and SplitterAssembly, and classes to help monitor running workflows.
The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. Ken Krugler is a committer for the Tika project.
One of the ways that we’ve contributed back is by integrating Boilerpipe into Tika, thus making it easy to extra “core text” from HTML pages.