Strata 2012 – Large scale web mining

Welcome to our Strata 2012 tutorial on “Large Scale Web Mining”.

Below you will find all of the steps needed to ensure that you are properly prepared for the tutorial lab.

Assumptions for Lab

  • You are able to use command line tools such as ant to build Java code
  • You are at least familiar with Java, and general programming
  • You are bringing a laptop suitable for software development
  • This laptop has been pre-configured, using the steps below
  • If your laptop runs Windows, you have installed & configured Cygwin (see instructions below)

Windows Users

IF YOU USE WINDOWS, you will first need to have a valid install of Cygwin, and then use it for all of the command-line steps described below.

Also please pay special attention to the CYGWIN notes that discuss issues you’ll need to be aware of during installation.

Finally, make sure you use cygwin to expand all compressed files mentioned below (via the “tar” command), versus using a Windows desktop application. The command from the cygwin terminal window is:

% tar -xvzf <archived file>

See http://voxforge.org/home/docs/cygwin-cheat-sheet for a good, concise summary of cygwin setup and commands.

Java Configuration

You need to ensure that the JAVA_HOME shell environment variable is set properly, and points to either Java 1.5 or Java 1.6. You can check with:

% java -version

The output should look something like:

java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode)

If you need to set up JAVA_HOME, you can do so via this command:

% export JAVA_HOME=<path to java home>

On MAC OS X 10.5 or later you should use:

% export JAVA_HOME=`/usr/libexec/java_home`

Hadoop Installation

You need to have Hadoop 0.20.2 installed. Note that if you have other versions of Hadoop installed, they will not work properly when using the lab to talk to the external Hadoop cluster, so you’ll need to add an installation of 0.20.2.

First, download and expand the hadoop-0.20.2.tar.gz file from an Apache mirror. To find a mirror site, go to http://www.apache.org/dyn/closer.cgi/hadoop/common/

For example, http://www.ecoficial.com/am/hadoop/common/hadoop-0.20.2/hadoop-0.20.2.tar.gz

You should wind up with a directory called “hadoop-0.20.2”.

Next, move the resulting hadoop-0.20.2 directory to a suitable location.

Finally, set up Hadoop shell environment variables via these commands:

% export HADOOP_HOME=//hadoop-0.20.2
% export PATH=$HADOOP_HOME/bin:$PATH

Note that the above commands can and probably should be added to your shell startup file (e.g. .bash_profile or .bashrc for the bash shell), so that it’s always set properly.

CYGWIN NOTE: The path you use *must* be a Cygwin path, not a Windows path. For example, /cygwin/…, not C:\…

Hadoop Validation

To verify that you have installed Hadoop correctly, execute this command:

% hadoop version

The output should look something like this:

Hadoop 0.20.2
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707
Compiled by chrisdo on Fri Feb 19 08:07:34 UTC 2010

You can also verify that hadoop is running from the correct directory with:

% which hadoop

The output should be the path into your Hadoop installation’s bin directory, e.g.:

/Users/kenkrugler/Tools/hadoop/bin/hadoop

Finally, you can run one of the built-in example jobs that come with Hadoop, e.g.:

% hadoop jar $HADOOP_HOME/hadoop-0.20.2-examples.jar pi 1 1

This will run the Pi estimator job (1 map, for 1 sample), and output text that looks like:

Number of Maps = 1
Samples per Map = 1
Wrote input for Map #0
Starting Job
11/05/19 13:50:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/05/19 13:50:17 INFO mapred.FileInputFormat: Total input paths to process : 1
...
Job Finished in 2.852 seconds
Estimated value of Pi is 4.00000000000000000000

Ant Installation

You need to have the Ant build tool installed.

First, see if you already have a recent version of Ant installed, by executing:

% ant -version

If the output is version 1.8 or later, you’re all set. E.g.

Apache Ant(TM) version 1.8.2 compiled on February 28 2011

If you need to install Ant, download and expand the apache-ant-1.8.2-bin.tar.gz file from an Apache mirror. To get an appropriate download link, go to http://ant.apache.org/bindownload.cgi.

On this page you’ll find a link to apache-ant-1.8.2-bin.tar.gz. For example, http://www.trieuvan.com/apache/ant/binaries/apache-ant-1.8.2-bin.tar.gz

You should wind up with a directory called “apache-ant-1.8.2”.

Next, move the resulting apache-ant-1.8.2 directory to a suitable location.

Next, add either add a symlink from /usr/bin/ant to the bin/ant file in this directory, or add ant to your PATH shell environment variable via these commands:

% export ANT_HOME=//apache-ant-1.8.2
% export PATH=$ANT_HOME/bin:$PATH

Finally, (re)verify that ant is installed correctly via:

% ant -version

Jar Installation

You need to have a large set of 3rd party jar files installed for the lab. These are used by the lab code to handle HTML fetching, parsing, and other related tasks.

Download and expand the strata-web-mining-lib.tgz file from the ScaleUnlimited web site, at https://scaleunlimited.com/downloads/strata2012/strata-web-mining-lib.tgz

You should wind up with a directory called “strata-web-mining-lib”.

Eclipse on Windows – CYGWIN NOTES

If you are running Eclipse on Windows, please read the following tutorial on installing Cygwin and setting the path properly:

Note that you can ignore everything else (Setup SSH daemon, Download Hadoop, etc), as that assumes you’re running Hadoop 0.19.1 in pseudo-distributed mode, which we definitely will NOT be doing during this class.