Paul O'Rorke summary of elastic web mining talk

November 4, 2009

Paul posted a nice summary of my elastic web mining talk over at his blog. He captured one of the key points I was trying to make when he said:

It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed to be custom coded “by hand.”

That’s a recurring theme when I show workflow graphs (dot files generated by Cascading) for example web mining applications that I’ve created. The real work is in figuring out what needs to be done (defining the workflow), not the coding to create the workflow or the custom bits that need to added.

Workflow Graph

Web mining app workflow

In the above graph, the purple ovals represent custom code, and of those six I could have cut out two by using existing Cascading operators with some regular expression juju. Add in the new Bixo utility operator for loading URLs into the workflow plus new Tika support for parsing mbox files, and you’re down to two custom operators – parsing the top-level “mailbox archives” page to find the monthly mailbox archives, and scoring the emails.

The blue and yellow ovals are pre-defined Cascading & Bixo operators (respectively).

And while the total workflow looks very complex, this was defined in about a page of Java code.

Comments are closed.