Open Source Data Mining Tools
Open Source Data Mining Tools
Below is a report on the open source data mining tools session at the ACM data mining unconference this past Sunday (01 Nov 2009).
This only covers tools that the panelists had used, so it’s not a survey of the available tools. See Jeff Dalton’s blog post on Java Open Source NLP and Text Mining tools for an example of a more complete list of a closely related group of tools.
Paul O’Rorke talked about Weka, a collection of machine learning algorithms for data mining tasks. Concerns about whether it’s still viable. One person said that pieces of it are still viable for clustering, feature selection.
An attendee mentioned MOA. MOA is a framework for data stream mining. Includes tools for evaluation and a collection of machine learning algorithms. Related to the WEKA project, also written in Java, while scaling to more demanding problems.
David Smith talked about R. Possible to quickly get results by using building blocks from other users. Often data is prepared before processing by R. On the back end is presentation tools. Sweave is a report generation backup that works well with R. Lots of research going on for out-of-memory modeling, to handle larger data sets. Also lots of work in parallel processing. BigMemory is a package for large models. Paul mentioned that R has a steep learning curve. David agreed that R is quirky, especially in terms of memory usage. See David’s blog post about the event.
Attendee asked about comparing Matlab & R, with respect to viability in a production environment. He’d run into memory problems with Matlab. David said that it was similar, and recommended doing scoring outside of R. He estimates 3-6x more memory is required for R vs. C++.
Attendee said many people use R for prototyping and generating models, but production uses something else. Examples would be Numpy and SciPy.
Paul mentioned that R provides a very compact representation of data mining tasks. (Ken – so it’s the APL of data mining?)
Nicolas Cebron talked about KNIME (pronounced “naim”), a modular data exploration platform. Started in 2004. knime.org has full details. He demonstrated the KNIME application, which has a nice GUI for working with data sets. The model can be output as PMML.
Attendee asked about long-term viability of KNIME. Nicolas said that it’s been around for 4 years, has a vibrant community, and there are commercial companies creating modules.
Ted Dunning talked about Mahout, an Apache open source project with the goal of scalable machine learning/data mining. Java is main language, Hadoop & Lucene are foundation technologies. Currently has good algorithms for clustering, kmeans. Reasonably good classifiers. Supervised learning algorithm. Also recommendation framework called TASTE. Very young project. Has support for sparse matrix math – might pool efforts with Apache commons math project. Mahout is mature enough for some types of machine learning problems.
Attendee asked about comparing Hadoop distributed file system (HDFS) and Sun distributed file system. Chris Wensel from Concurrent explained that HDFS is very specialized, optimized for streaming reads. Can’t do random updates to files. Scales to 1000s of servers. Very fault tolerant. Ted confirmed that it’s very reliable, with a humorous story about a cluster of the world’s worst servers.
Ken Krugler (your faithful scribe) talked about the HECB (Hadoop, EC2, Cascading, Bixo) stack for web mining. Focus is on the collection and initial processing/reduction of the data, not hard core machine learning & data mining.