Open Source Projects

At Scale Unlimited, we use a lot of open source software. And we contribute back to the community, via the following projects:

flink-crawler

flink-crawler is an efficient, scalable, continuous web crawler built on top of Flink. It is designed to be used for focused web crawls, without any additional infrastructure requirements.

https://github.com/bixo/bixo

Bixo

Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications.We are the primary contributors to the project.

https://github.com/bixo/bixo

cascading.avro

cascading.avro is a Cascading Scheme for the Apache Avro data serialization format. Using this scheme, you can easily use Avro files as both input and output formats for your Hadoop jobs.

https://github.com/ScaleUnlimited/cascading.avro

cascading.solr

cascading.solr is a Cascading Scheme for Solr. Using this scheme, you can easily generate Solr-compatible Lucene indexes from Hadoop jobs.

https://github.com/ScaleUnlimited/cascading.solr

cascading.utils

cascading.utils is a set of utilities for Cascading workflows. For example there are classes that wrap Cascading Tuples with “datum” objects, utility classes such as TupleLogger and SplitterAssembly, and classes to help monitor running workflows.

https://github.com/ScaleUnlimited/cascading.utils

Tika

The Apache Tika toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. Ken Krugler is a committer for the Tika project.

One of the ways that we’ve contributed back is by integrating Boilerpipe into Tika, thus making it easy to extra “core text” from HTML pages.

http://svn.apache.org/repos/asf/tika/trunk