May 28, 2016
Earlier this month I flew to Vancouver, a wonderful city I’d never had the chance to visit. My excuse was that I was giving a talk at this year’s ApacheCon Big Data conference, which took place in Vancouver from May 9th to 12th. Part of the fun of attending a conference like this is the chance to meet people I’d only interacted with via email. For example, Nick Burch is more…
October 18, 2014
In the last few months I’ve given two different talks about scalable fuzzy matching. The first was a Strata in San Jose, titled Similarity at Scale. In that talk I focused mostly on techniques for doing fuzzy matching (or joins) between large data sets, primarily via Cascading workflows. More recently I presented at Cassandra Summit 2014, on Fuzzy Entity Matching. This was a different take on the same issue, where more…
July 21, 2013
In my previous blog post on text feature selection, I’d covered some of the key steps: Extract the relevant text from the content. Tokenize this text into discrete words. Normalize these words (case-folding, stemming) (and a bit of filtering out “bad words”). In this blog post I’m going to talk about improving the quality of the terms. But first I wanted to respond to some questions from part 1, about more…
July 10, 2013
We do a lot of projects that require extracting text features from documents, for use with recommendation systems, clustering and classification. Often the “document” is an entity like a person, a company, or a web site. In these cases, the text for each document is the aggregation of all text associated with each entity – for example, it could be the text from all pages crawled for a given blog more…
July 3, 2013
As of today, the Durkheim Project is now live. This is a research project involving Patterns and Predictions, the Geisel School of Medicine at Dartmouth, the U.S. Department of Veterans Affairs (VA) and Facebook. See the Durkheim Project launch announcement for full details. The worthy goal of the Durkheim Project is to improve the medical community’s ability to predict suicides. The driving force was original the military’s concern about increasing more…
October 22, 2012
For one of our clients, we’d developed a series of complex workflows using Cascading 1.2 that get run multiple times every week, using Amazon’s Elastic MapReduce. These 15 or so higher-level workflows get planned by Cascading into 80+ Hadoop jobs, which Cascading takes care of running for us. That part has been working well, and the end result (a set of Solr indexes) powers the Adbeat web site. But we’ve more…