Please see our Training page for details about all of our courses.
In the last few months I've given two different talks about scalable fuzzy matching.
The first was a Strata in San Jose, titled Similarity at Scale. In that talk I focused mostly on techniques for doing fuzzy matching (or joins) between large data sets, primarily via Cascading workflows.
More recently I presented more...
At Scale Unlimited we participate in a number of open source projects. Many of these have been recently updated...
cascading.utils (2.6.0) - Updated to Hadoop 2.4 & Cascading 2.6. Fixed job naming issue. More flexible tuple logging.
bixo (0.9.2) - Updated to Hadoop 2.4 & Cascading 2.6. Fixed bug with extracted outlink data.
crawler-commons (0.6) - Many sitemap & robots.txt processing fixes and improvements.
Tika (1.9) - Fixes for external parsers, new formats, improved server functionality, and much more.