Fuzzy matching at Scale

October 18, 2014

In the last few months I’ve given two different talks about scalable fuzzy matching.

The first was a Strata in San Jose, titled Similarity at Scale. In that talk I focused mostly on techniques for doing fuzzy matching (or joins) between large data sets, primarily via Cascading workflows.

More recently I presented at Cassandra Summit 2014, on Fuzzy Entity Matching. This was a different take on the same issue, where the focus was ad hoc queries to match one target against a large corpus. The approach I covered in depth was to use Solr queries to create a reduced set of candidates, after which you could apply typical “match distance” heuristics to re-score/re-rank the results.

The video for this second talk is freely available (thanks, DataStax!) and you can watch me lead off with an “uhm” right here.

Comments are closed.