What is Web Mining?

Web mining is the use of data mining techniques to automatically discover and extract information from Web documents and services.

There are three general classes of information that can be discovered by web mining:

  • Web activity, from server logs and Web browser activity tracking.
  • Web graph, from links between pages, people and other data.
  • Web content, for the data found on Web pages and inside of documents.

At Scale Unlimited we focus on the last one – extracting value from web pages and other documents found on the web.

Note that there’s no explicit reference to “search” in the above description. While search is the biggest web miner by far, and generates the most revenue, there are many other valuable end uses for web mining results. A partial list includes:

  • Business intelligence
  • Competitive intelligence
  • Pricing analysis
  • Events
  • Product data
  • Popularity
  • Reputation

Four Steps in Content Web Mining

When extracting Web content information using web mining, there are four typical steps.

  1. Collect – fetch the content from the Web
  2. Parse – extract usable data from formatted data (HTML, PDF, etc)
  3. Analyze – tokenize, rate, classify, cluster, filter, sort, etc.
  4. Produce – turn the results of analysis into something useful (report, search index, etc)

Web Mining versus Data Mining

When comparing web mining with traditional data mining, there are three main differences to consider:

  1. Scale – In traditional data mining, processing 1 million records from a database would be large job. In web mining, even 10 million pages wouldn’t be a big number.
  2. Access – When doing data mining of corporate information, the data is private and often requires access rights to read. For web mining, the data is public and rarely requires access rights. But web mining has additional constraints, due to the implicit agreement with webmasters regarding automated (non-user) access to this data. This implicit agreement is that a webmaster allows crawlers access to useful data on the website, and in return the crawler (a) promises not to overload the site, and (b) has the potential to drive more traffic to the website once the search index is published. With web mining, there often is no such index, which means the crawler has to be extra careful/polite during the crawling process, to avoid causing any problems for the webmaster.
  3. Structure – A traditional data mining task gets information from a database, which provides some level of explicit structure. A typical web mining task is processing unstructured or semi-structured data from web pages. Even when the underlying information for web pages comes from a database, this often is obscured by HTML markup.

Note that by “traditional” data mining we mean the type of analysis supported by most vendor tools, which assumes you’re processing table-oriented data that typically comes from a database.