First Sample of Public Terabyte Dataset

April 21, 2010

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files.

In order to get you jump started with leveraging this dataset, we have posted a small sample of the dataset in S3 in the bixolabs-ptd-demo bucket. Along with that is the Avro JSON schema to access the file. For those unfamiliar with working with Avro files, here’s a sample snippet that illustrates one way of reading them:

Schema schema = Schema.parse(jsonSchemaFile);
DataFileReader<Object>  reader = new DataFileReader<Object>(avroFile, new GenericDatumReader<Object>(schema));
while (reader.hasNext()) {
GenericData.Record obj =  (Record) reader.next();
// You can access the fields in this object like this...
System.out.println(obj.get("AvroDatum-url"));
}

Please take a look, and let us know if there’s any missing raw content that you’d want. We’ve intentionally avoided doing post-processing of the results – this is source data for exactly that type of activity.

SimpleDB Tap for Cascading

March 16, 2010

Recently we’ve been running a number of large, multi-phase web mining applications in Amazon’s EC2 & Elastic MapReduce (EMR), and we needed a better way to maintain state than pushing sequence files back and forth between HDFS and S3.

One option was to set up an HBase cluster, but then we’d be paying 24×7 for servers that we’d only need for a few minutes each day. We could also set up MySQL with persistent storage on an Amazon EBS volume, but then we’d have to configure & launch MySQL on our cluster master for each mining “loop”, and for really big jobs it wouldn’t scale well to 100M+ items.

So we spent some time creating a Cascading tap & scheme that lets us use Amazon’s SimpleDB to maintain the state of web mining & crawling jobs. It’s still pretty rough, but usable. The code is publicly available at GitHub – check out http://github.com/ScaleUnlimited/cascading.simpledb.

There’s also a README to help you get started, which I’ve copied below since it contains useful information about the project.

As to the big question on performance – not sure yet how well it handles a SimpleDB with 100M+ entries, but we’re heading there fast on one project, so more details to follow. Enjoy…

README

Introduction

cascading.simpledb is a Cascading Tap & Scheme for Amazon’s SimpleDB.

This means you can use SimpleDB as the source of tuples for a Cascading
flow, and as a sink for saving results. This is particularly useful when
you need a scalable, persistent store for Cascading jobs being run in
Amazon’s EC2/EMR cloud environment.

Information about SimpleDB is available from http://aws.amazon.com/simpledb/
and also http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/

Note that you will need to be signed up to use both AWS and SimpleDB, and
have valid AWS access key and secret key values before using this code. See
http://docs.amazonwebservices.com/AmazonSimpleDB/2009-04-15/GettingStartedGuide/GettingSetUp.html

Design

In order to get acceptable performance, the cascading.simpledb scheme splits
each virtual “table” of data in SimpleDB across multiple shards. A shard
corresponds to what SimpleDB calls a domain. This allows most requests to
be run in parallel across multiple mappers, without having to worry about
duplicate records being returned for the same request.

Each record (Cascading tuple, or SimpleDB item) has an implicit field called
SimpleDBUtils-itemHash. This is a zero-padded hash of the record’s key or item
value. This is another SimpleDB concept – every record has a unique key, used
to read it directly.

Records (items) are split between shards using partitions of this hash value. This
implies that once a table has been created and populated with items, there is no
easy way to change the number of shards; you essentially have to build a new
table and copy all of the values.

The implicit itemHash field could also be used to parallelize search requests within
a single shard, by further partitioning. This performance boost is not yet
implemented by cascading.simpledb, however.

Example

  // Specify fields in SimpleDB item, and which field is to be used as the key.
  Fields itemFields = new Fields("name", "birthday", "occupation", "status");
  SimpleDBScheme sourceScheme = new SimpleDBScheme(itemFields, new Fields("name"));
  // Load people that haven't yet been contacted, up to 1000
  String query = "`status` = \"NOT_CONTACTED\"";
  sourceScheme.setQuery(query);
  sourceScheme.setSelectLimit(1000);
  int numShards = 20;
  String tableName = "people"
  Tap sourceTap = new SimpleDBTap(sourceScheme, accessKey, secretKey, tableName, numShards);
  Pipe processingPipe = new Each("generate email pipe", new SendEmailOperation());
  // Use the same scheme as the source - query & limit are ignored
  Tap sinkTap = new SimpleDBTap(sourceScheme, accessKey, secretKey, tableName, numShards);
  Flow flow = new FlowConnector().connect(sourceTap, sinkTap, processingPipe);
  flow.complete();

Limitations

All limitations of the underlying SimpleDB system obviously apply. That means
things like the maximum number of shards (100), the maximum size of any one
item (1MB), the maximum size of any field value (1024) and so on. See details
at http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/SDBLimits.html

Given the 1K max field value length, SimpleDB is most useful for storing small
chunks of data, or references to bigger blobs that can be saved in S3.

In addition, all values are stored as strings. This means all fields must be
round-trippable as text.

Finally, SimpleDB does not guarantee immediate consistencycan guarantee consistency when reading back
the results of a write (or doing queries for the same), but this imposes a significant performance penalty, so the tap uses the default “eventually consistent” mode. This typically isn’t a problem due to the batch-oriented nature of most Cascading workflows, but could be an issue if you had multiple jobs writing to and reading from the same table.

Known Issues

Currently you need to correctly specify the number of shards for a table when
you define the tap. This is error prone, and only necessary when creating (or
re-creating) the table from scratch.

Tuple values that are null will not be updated in the table, which means you
can’t delete values, only add or update them.

Some operations are not multi-threaded, and thus take longer than they should.
For example, calculating the splits for a read will make a series of requests
to SimpleDB to get the item counts.

Numeric fields should automatically be stored as zero-padded strings to ensure
proper sort behavior, but currently this is only done for the implicit hash field.

Building

You need Apache Ant 1.7 or higher, and a git client.

1. Download source from GitHub

% git clone git://github.com/ScaleUnlimited/cascading.simpledb.git
% cd cascading.simpledb

2. Set appropriate credentials for testing

% edit src/test/resources/aws-keys.txt

Enter valid AWS access key and secret key values for the two corresponding properties.

3. Build the jar

% ant clean jar

or to build and install the jar in your local Maven repo:

% ant clean install

4. Create Eclipse project files

% ant eclipse

Then, from Eclipse follow the standard procedure to import an existing Java project into your Workspace.

Crawler-commons project gets started

December 3, 2009
Back in November we helped put together a small gathering for web crawler developers at ApacheCon 2009. One of the key topics was how to share development efforts, versus each project independently implementing similar functionality.

Out of this was born the crawler-commons project. As the main page says:

The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components would benefit from collaboration among various existing web crawler projects, and reduce duplication of effort.

There’s a long list of functionality that is identical, or nearly so, between the various projects. The project wiki has a more detailed write-up from the ApacheCon meeting, but a short list includes:

  • robots.txt parsing
  • URL normalization
  • URL filtering
  • Domain name manipulation
  • HTML page cleaning
  • HttpClient configuration
  • Text similarity

It’s still early, but some initial code has been submitted to the Google Code SVN repository. And anybody with an interest in the area of Java web crawlers should use this feed to track project updates.

Public web crawler projects

December 2, 2009

Several people have pointed me to other public/non-profit projects doing large-scale public web crawls, so I thought I’d summarize the ones I now know about below. And if you know of others, please add your comments and I’ll update the list.

  • Wayback Machine – A time-series snapshot of important web pages, from 1996 to present. 150B pages crawled in total as of 2009. The data is searchable, but not available in raw format AFAIK. The work is part of the Internet Archive organization, and uses Heritrix for crawling.
  • CDL Web Archiving Service – The California Digital Library provides the Web Archiving Service to enable librarians and scholars to create archives of captured web sites and publications. Similar to the Wayback Machine, they use Heritrix and other software from the Internet Archive, and the results are searchable but not available in raw format.
  • CommonCrawl – Their goal is to build, maintain and make widely available a comprehensive crawl of the Internet. They use Nutch (useragent is ccBot). I’ve seen Ahad Rana post to the Nutch list. So far I haven’t seen any actual search or raw data results from this project. The do have a cool public “crawl stats” page.
  • UK Web Archive – A “Wayback Machine” for UK web sites. Provided by the British Library. Searchable, but no raw data that I can see. They in turn sponsor the Web Curator Tool, which is an open-source workflow management application for selective web archiving (driver for Heritrix).
  • Isara Search – A project sponsored by Isara Charity Organization to build the world’s first non-profit search engine. Based in Thailand, using Nutch. No search/data available yet.
  • ClueWeb09 – The ClueWeb09 dataset was created by the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The data is available to researchers who sign a legal agreement and pay $750 for the hard disks needed to store the data.
  • WebBase – The Stanford WebBase project has been collecting topic focused snapshots of Web sites. All the resulting archives are available to the public via fast download streams. The useragent is WebVac (was Pita). There’s also a web GUI for fetching specific crawl sets.
  • Laboratory for Web Algorithmics – Uses UbiCrawler to create large-scale link graph datasets that can be freely downloaded.

Proposals for Big Data web mining talk

November 16, 2009

I’m going to be giving a talk at the Bay Area ACM data mining SIG in December, and I need to finalize my topic soon – like today 🙂

I was going to expand on my Elastic Web Mining talk (“Web mining for SEO keywords”) from the ACM data mining unconference a few weeks back.

But the fact that I’ll have 10s to 100s of millions of web page data to work with, from the public terabyte dataset crawl, makes me want to apply Mahout to the data.

I tossed out one idea on the Mahout list, looking for input:

  • I’d like to automatically generate a timeline of events.
  • I can extract potential dates from web pages, using simple patterns.
  • I can extract 2-to-4 word terms (skipping those which start/end with stop words) from pages that have extracted dates.
  • And then by the miracle of LDA (latent dirichlet allocation), I get clusters of date+terms.

But in this example, I don’t actually need LDA – I have my “topic”, which is the date. So it might not be a very good example. And will LDA scale to 100M web pages (which implies many billions of terms)? And how will I handle the same term (e.g. “barack inauguration”) being associated with a cluster of dates, since stories from a range of dates before/after the event will contain that same term?

So it could be a non-starter – I’m hoping for input on feasibility, level of effort, or if somebody else has a suggestion for something simple that could provide interesting/obvious results, I’m all ears.

Thanks!

— Ken

PS – my current fall-back is to just do brute-force map-reduce to come up with lists of terms per unique date, pick the top N, and maybe do some filtering for top-level terms that have too many associated unique dates. Which unfortunately wouldn’t use Mahout, but would be an example of crunching lots of data.

Web Miners vs Web Masters – An Uneasy Truce

November 11, 2009

The life of a webmaster is hard, and web crawlers make it harder

Angry Face

 

There’s the daily drama of keeping both web site users and web site developers happy. Now mix in the unpredictable side effects of having automated agents hitting the site, and you can see why webmasters might think many web crawlers are evil.

But web crawlers serve a very real, important role in the life of a successful site, and it’s all about traffic. Without search engines like Google and Yahoo/Bing, most sites would be invisible to most users.

Implicit Contracts

An unwritten agreement exists between webmasters and web crawlers, and it reads something like this: you don’t overload my site, and you bring traffic my way. In return, I’ll give you free access lots of valuable content that I host.

And that’s worked reasonably well, for the past 15 years. Yes, there are crawlers that ignore the Robots Exclusion Standard. And there are crawlers that overload the site by hammering it with lots of simultaneous requests for hours on end. And sometimes a crawler goes a little crazy and spends hours trying to fetch non-existent pages using bogus URLs that it incorrectly derived from content on the site’s pages. For the most part, though, web crawlers try to do the Right Thing, and webmasters can always block rogue crawlers by IP address.

Web Mining != Search Index

But now you’ve got web miners – automated agents that collect data which often doesn’t wind up in a search index. And that means no traffic from searches. And thus the implicit contract has been broken.

It hasn’t happened yet, but I can see a day when many sites set up their robots.txt to allow the major search engines access, and then block everybody else.

What does this mean for the web eco-system? Three things, one for each participant:

  1. Web miners need to crawl extra-super-politely.
  2. Customers need to work with key sites to pick good crawl times.
  3. Web sites need to offer for-fee APIs for data mining.

The first point is the easiest one to solve – never hit a site with more than one simultaneous request, never fetch more than a handful of pages a minute, and respect all robots.txt restrictions.

The second is a bit harder, as it currently requires person-to-person contact with the web site in question. It’s possible to derive these “good crawl times” by varying the request rate with the response performance, so there are work-arounds. But eventually I expect to see an extension to robots.txt that lets the site owner provide additional clues to web crawlers about good and bad times for crawling.

The last point, about providing APIs, is the most long-term but also the most powerful. There are many web APIs out there, some of which provide access to valuable web data, but few offer a pay-to-play model. Most are rate limited, where you need to cut special deals if you exceed some relatively low daily threshold. Many have serious terms of use restrictions that limit a caller’s ability to actually mine the response data – often the only option is to republish it, with links/attribution back to the originating site.

What would be great is if everybody had a model like Amazon’s AWIS, where X requests cost N dollars. You can decide how much or how little to spend. There aren’t many restrictions on rate/volume or usage. And as a huge added bonus, the data comes back structured, so you don’t have to waste time hand-crafting some fragile, error-prone HTML scraping code.

And a side-note to companies thinking about the API issue – if you don’t provide one, and you block web miners, then you’ll get crawled anyway, in stealth mode by less scrupulous firms. So then everybody loses, since you’ll still be giving free access while taking a performance hit, while companies that need the data pay more to these “stealth crawlers” and get worse results.

Paul O'Rorke summary of elastic web mining talk

November 4, 2009

Paul posted a nice summary of my elastic web mining talk over at his blog. He captured one of the key points I was trying to make when he said:

It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed to be custom coded “by hand.”

That’s a recurring theme when I show workflow graphs (dot files generated by Cascading) for example web mining applications that I’ve created. The real work is in figuring out what needs to be done (defining the workflow), not the coding to create the workflow or the custom bits that need to added.

Workflow Graph

Web mining app workflow

In the above graph, the purple ovals represent custom code, and of those six I could have cut out two by using existing Cascading operators with some regular expression juju. Add in the new Bixo utility operator for loading URLs into the workflow plus new Tika support for parsing mbox files, and you’re down to two custom operators – parsing the top-level “mailbox archives” page to find the monthly mailbox archives, and scoring the emails.

The blue and yellow ovals are pre-defined Cascading & Bixo operators (respectively).

And while the total workflow looks very complex, this was defined in about a page of Java code.

Elastic Web Mining Talk

November 2, 2009

Here’s the presentation I gave at the ACM data mining unconference on elastic web mining – how to create scalable, reliable and cost effective web mining solutions using an open source stack (Hadoop, Cascading, Bixo) running in Amazon’s Elastic Compute Cloud (EC2).

[slideshare id=2407600&doc=acmuctalk-091102194640-phpapp02]

But I don’t see my notes showing up, so here’s the PDF version with full notes, which make the resulting slides a lot more meaningful.

[slideshare id=2407818&doc=acmtalk-slideshare-091102203022-phpapp02&type=d]

Session writeups for ACM data mining unconference

November 2, 2009

I wound up being the scribe for two sessions at this past Sunday’s ACM data mining unconference.

The first session was on open/public datasets, which are very useful for people working on data mining algorithms.

The second session (last one of the day) was on open source data mining tools. Lots of people at this one, with a nice demo on KNIME and a good discussion of the R language pros/cons for data mining tasks.

 

Announcing the Public Terabyte Dataset project

November 1, 2009

We’re very excited to announce the Public Terabyte Dataset project.

This is a high quality crawl of top web sites, using AWS’s Elastic MapReduce, Concurrent’s Cascading workflow API, and Bixo Lab’s elastic web mining platform.

Hosting for the resulting dataset will be provided by Amazon in S3, and freely available to all EC2 users.

In addition, the code used to create and process the dataset will be available for download from http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263

Questions and input on the project can be submitted at https://scaleunlimited.com/PTD/