Text feature selection for machine learning – part 2

July 21, 2013

In my previous blog post on text feature selection, I’d covered some of the key steps: Extract the relevant text from the content. Tokenize this text into discrete words. Normalize these words (case-folding, stemming) (and a bit of filtering out “bad words”). In this blog post I’m going to talk about improving the quality of the terms. But first I wanted to respond to some questions from part 1, about more…

Text feature selection for machine learning – part 1

July 10, 2013

We do a lot of projects that require extracting text features from documents, for use with recommendation systems, clustering and classification. Often the “document” is an entity like a person, a company, or a web site. In these cases, the text for each document is the aggregation of all text associated with each entity – for example, it could be the text from all pages crawled for a given blog more…