Skip to content

First Sample of Public Terabyte Dataset

2010 April 21
by kkrugler

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files.

In order to get you jump started with leveraging this dataset, we have posted a small sample of the dataset in S3 in the bixolabs-ptd-demo bucket. Along with that is the Avro JSON schema to access the file. For those unfamiliar with working with Avro files, here’s a sample snippet that illustrates one way of reading them:

Schema schema = Schema.parse(jsonSchemaFile);
DataFileReader<Object>  reader = new DataFileReader<Object>(avroFile, new GenericDatumReader<Object>(schema));
while (reader.hasNext()) {
GenericData.Record obj =  (Record) reader.next();
// You can access the fields in this object like this...
System.out.println(obj.get("AvroDatum-url"));
}

Please take a look, and let us know if there’s any missing raw content that you’d want. We’ve intentionally avoided doing post-processing of the results – this is source data for exactly that type of activity.

7 Responses Post a comment
  1. elhoim permalink
    April 26, 2010

    Where could i get the sample?

  2. April 26, 2010

    The sample Avro file is in S3 at /bixolabs-ptd-demo.

    The code to read the file is in GitHub at http://github.com/bixolabs/cascading.avro

    Though note that for individual files, you can just use something like:


    Schema s = Schema.parse(new File("ptd-sample.json"));

    DatumReader dr = new GenericDatumReader(s);
    DataFileStream in = new DataFileStream(new FileInputStream(new File("ptd-sample.avro")), dr);

    while (in.hasNext()) {
    GenericData.Record o = in.next();
    Utf8 charset = (Utf8) o.get("AvroDatum-charset");
    if (charset != null && charset.getLength() > 0) {
    System.out.printf("%s\n", charset);
    }
    }

  3. anonymous permalink
    October 17, 2010

    Can someone tell me how to access actual dataset and not the sample file?

  4. kkrugler permalink*
    October 18, 2010

    Hi anonymous,

    The actual dataset hasn’t been released yet – see previous comments as to why, and my best guess re timing.

    Regards,

    – Ken

  5. Anonymous permalink
    October 18, 2010

    Thanks. But, I cant find your comment where you explain the reason behind the delay. Anyways, I am desperately waiting for this data. Will there be any restrictions on using this data for commercial purpose when it is released?

  6. kkrugler permalink*
    October 18, 2010

    Hi anonymous,

    My mistake, the “previous comments” reference was for this page: http://bixolabs.com/datasets/public-terabyte-dataset-project/

    As far as restrictions on commercial use – that’s going to be up to Amazon’s lawyers, but in general your use of this publicly available web data is subject to the same conditions as if you crawled it yourself. Which means respecting copyrights, removal of data upon notice by the owner, etc. all apply.

    Regards,

    – Ken

Trackbacks & Pingbacks

  1. Hadoop User Group Meetup Talk « Elastic Web Mining | Bixo Labs

Leave a Reply

Note: You may use basic HTML in your comments. Your email address will not be published.

Subscribe to this comment feed via RSS