ApacheCon Big Data 2016

May 28, 2016

Earlier this month I flew to Vancouver, a wonderful city I’d never had the chance to visit. My excuse was that I was giving a talk at this year’s ApacheCon Big Data conference, which took place in Vancouver from May 9th to 12th. Part of the fun of attending a conference like this is the chance to meet people I’d only interacted with via email. For example, Nick Burch is more…

Fuzzy matching at Scale

October 18, 2014

In the last few months I’ve given two different talks about scalable fuzzy matching. The first was a Strata in San Jose, titled Similarity at Scale. In that talk I focused mostly on techniques for doing fuzzy matching (or joins) between large data sets, primarily via Cascading workflows. More recently I presented at Cassandra Summit 2014, on Fuzzy Entity Matching. This was a different take on the same issue, where more…

Text feature selection for machine learning – part 2

July 21, 2013

In my previous blog post on text feature selection, I’d covered some of the key steps: Extract the relevant text from the content. Tokenize this text into discrete words. Normalize these words (case-folding, stemming) (and a bit of filtering out “bad words”). In this blog post I’m going to talk about improving the quality of the terms. But first I wanted to respond to some questions from part 1, about more…

Text feature selection for machine learning – part 1

July 10, 2013

We do a lot of projects that require extracting text features from documents, for use with recommendation systems, clustering and classification. Often the “document” is an entity like a person, a company, or a web site. In these cases, the text for each document is the aggregation of all text associated with each entity – for example, it could be the text from all pages crawled for a given blog more…

The Durkheim Project goes live!

July 3, 2013

As of today, the Durkheim Project is now live. This is a research project involving Patterns and Predictions, the Geisel School of Medicine at Dartmouth, the U.S. Department of Veterans Affairs (VA) and Facebook. See the Durkheim Project launch announcement for full details. The worthy goal of the Durkheim Project is to improve the medical community’s ability to predict suicides. The driving force was original the military’s concern about increasing more…

Faster Tests with Cascading 2.0 Local Mode

October 22, 2012

For one of our clients, we’d developed a series of complex workflows using Cascading 1.2 that get run multiple times every week, using Amazon’s Elastic MapReduce. These 15 or so higher-level workflows get planned by Cascading into 80+ Hadoop jobs, which Cascading takes care of running for us. That part has been working well, and the end result (a set of Solr indexes) powers the Adbeat web site. But we’ve more…