Proposals for Big Data web mining talk

November 16, 2009

I’m going to be giving a talk at the Bay Area ACM data mining SIG in December, and I need to finalize my topic soon – like today 🙂

I was going to expand on my Elastic Web Mining talk (“Web mining for SEO keywords”) from the ACM data mining unconference a few weeks back.

But the fact that I’ll have 10s to 100s of millions of web page data to work with, from the public terabyte dataset crawl, makes me want to apply Mahout to the data.

I tossed out one idea on the Mahout list, looking for input:

  • I’d like to automatically generate a timeline of events.
  • I can extract potential dates from web pages, using simple patterns.
  • I can extract 2-to-4 word terms (skipping those which start/end with stop words) from pages that have extracted dates.
  • And then by the miracle of LDA (latent dirichlet allocation), I get clusters of date+terms.

But in this example, I don’t actually need LDA – I have my “topic”, which is the date. So it might not be a very good example. And will LDA scale to 100M web pages (which implies many billions of terms)? And how will I handle the same term (e.g. “barack inauguration”) being associated with a cluster of dates, since stories from a range of dates before/after the event will contain that same term?

So it could be a non-starter – I’m hoping for input on feasibility, level of effort, or if somebody else has a suggestion for something simple that could provide interesting/obvious results, I’m all ears.

Thanks!

— Ken

PS – my current fall-back is to just do brute-force map-reduce to come up with lists of terms per unique date, pick the top N, and maybe do some filtering for top-level terms that have too many associated unique dates. Which unfortunately wouldn’t use Mahout, but would be an example of crunching lots of data.

Comments are closed.