First Sample of Public Terabyte Dataset

April 21, 2010

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files.

In order to get you jump started with leveraging this dataset, we have posted a small sample of the dataset in S3 in the bixolabs-ptd-demo bucket. Along with that is the Avro JSON schema to access the file. For those unfamiliar with working with Avro files, here’s a sample snippet that illustrates one way of reading them:

Schema schema = Schema.parse(jsonSchemaFile);
DataFileReader<Object>  reader = new DataFileReader<Object>(avroFile, new GenericDatumReader<Object>(schema));
while (reader.hasNext()) {
GenericData.Record obj =  (Record) reader.next();
// You can access the fields in this object like this...
System.out.println(obj.get("AvroDatum-url"));
}

Please take a look, and let us know if there’s any missing raw content that you’d want. We’ve intentionally avoided doing post-processing of the results – this is source data for exactly that type of activity.

7 Responses to “First Sample of Public Terabyte Dataset”

  1. […] to Doug & Scott for that). Vivek mentioned this new project in his recent blog post about our first release of PTD data, and we’re looking forward to others using this to read/write Avro […]

  2. Where could i get the sample?

  3. The sample Avro file is in S3 at /bixolabs-ptd-demo.

    The code to read the file is in GitHub at http://github.com/bixolabs/cascading.avro

    Though note that for individual files, you can just use something like:


    Schema s = Schema.parse(new File("ptd-sample.json"));

    DatumReader dr = new GenericDatumReader(s);
    DataFileStream in = new DataFileStream(new FileInputStream(new File("ptd-sample.avro")), dr);

    while (in.hasNext()) {
    GenericData.Record o = in.next();
    Utf8 charset = (Utf8) o.get("AvroDatum-charset");
    if (charset != null && charset.getLength() > 0) {
    System.out.printf("%s\n", charset);
    }
    }

  4. Can someone tell me how to access actual dataset and not the sample file?

  5. Hi anonymous,

    The actual dataset hasn’t been released yet – see previous comments as to why, and my best guess re timing.

    Regards,

    — Ken

  6. Thanks. But, I cant find your comment where you explain the reason behind the delay. Anyways, I am desperately waiting for this data. Will there be any restrictions on using this data for commercial purpose when it is released?

  7. Hi anonymous,

    My mistake, the “previous comments” reference was for this page: http://bixolabs.com/datasets/public-terabyte-dataset-project/

    As far as restrictions on commercial use – that’s going to be up to Amazon’s lawyers, but in general your use of this publicly available web data is subject to the same conditions as if you crawled it yourself. Which means respecting copyrights, removal of data upon notice by the owner, etc. all apply.

    Regards,

    — Ken