Skip to content

Public Terabyte Dataset Project

Public Terabyte Dataset Project

This page has more details on the Public Terabyte Dataset project, which was recently announced at the ACM data mining unconference.

  • The data comes from a crawl of 50-200M pages from the 100K 1 million top (by US-based traffic) domains.
  • The crawl is done by a custom Bixo workflow created by Scale Unlimited, built on top of Cascading/Hadoop and running in EC2 using Amazon’s Elastic MapReduce service.
  • We’ll be trying hard to avoid spam/adult content, though getting totally clean results is of course impossible.
  • We honor the robots nofollow and various “no archive” HTML meta tags, and we’ll comply promptly with requests by web site hosters to remove any of their content from the datasets. Note that the rel=”nofollow” attribute on links themselves does not mean that the link shouldn’t be followed, but rather that the link shouldn’t be used when calculating a “PageRank” score.
  • The resulting data will be stored as compressed warcAvro files in S3.
  • Hosting for the dataset is being provided by Amazon.
  • Access to the data is free, assuming you’re running code in EC2.
  • The code used to run the crawl, as well as code to access the crawl data, will be available at http://developer.amazonwebservices.com/connect/kbcategory.jspa?categoryID=263.

There’s a form where you can request information and provide input on the crawl.

13 Responses Post a comment
  1. Dan permalink
    January 25, 2010

    How did you get the top 100K websites by traffic?
    Is the datasets all pages from those websites?
    How did you get Amazon to host the data?
    What do you view as the value of this dataset?

  2. January 25, 2010

    Hi Dan,

    1. Top sites by traffic come from Alexa web services API.
    2. The dataset focuses on these sites, but expands to include others.
    3. Amazon is helping because they are interested in useful public datasets and examples of effectively using EMR.
    4. The dataset has highest value for people working on text processing algorithms, and for doing performance baselining/optimizations.

    – Ken

  3. elhoim permalink
    March 3, 2010

    Alexa and quantcast both have a free “top 1 million” list that is updated daily.

  4. March 3, 2010

    Hi Joel,

    Excellent input, thanks! The Alexa web service API offers a bit more information (e.g. is it “adult”), but it feels out of date (e.g. no walmart.com data). By merging these together, I’ll have a much better see list.

    – Ken

  5. elhoim permalink
    March 3, 2010

    premiumdrops.com is also offering copies of .com/.net/.org zones if you want to constitue a wide seed list.

  6. March 3, 2010

    @elhoim – I hadn’t found the domain lists on premiumdrops.com …good stuff.

    Sounds like you’ve been poking around this space a bit in the past :)

    – Ken

  7. August 17, 2010

    What is the status of the crawl? When will a sample be released?

    This dataset will make a lot of people very happy.

  8. kkrugler permalink*
    September 15, 2010

    Hi Joseph,

    I’d love to release the dataset too :)

    We’d run into a cost issue using AWS’s SimpleDB for many links (e.g. 1 billion), so had to revert back to storing the crawl state (aka CrawlDB) in Hadoop SequenceFiles. That, plus wanting to use spot instances for better pricing meant re-working some of the (apparently abandoned) hadoop-ec2 scripts that come with Hadoop.

    I think we should have a dataset (maybe not the full terabyte, but big) ready in a month, since we’ve dealt with the above two issues.

    Regards,

    – Ken

  9. October 9, 2010

    Ken,

    Great! I am most excited to see the dataset when it is available. Let me know if you would like me to write followup posts to help you announce it.

    I would also be interested in seeing a writeup of the technical challenges you encountered, and how you resolved them. But I’m most eager to play with the actual data!

Trackbacks & Pingbacks

  1. Bixolabs goes public « Ken's Techno Tidbits
  2. The web is an endless series of edge cases « Ken's Techno Tidbits
  3. First Sample of Public Terabyte Dataset « Elastic Web Mining | Bixo Labs
  4. Hadoop User Group Meetup Talk « Elastic Web Mining | Bixo Labs

Leave a Reply

Note: You may use basic HTML in your comments. Your email address will not be published.

Subscribe to this comment feed via RSS