First Sample of Public Terabyte Dataset

April 21, 2010

Tags: avro, cascading, public terabyte dataset

We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we’re using Cascading for this project, we have also released a Cascading Avro Scheme to read and write Avro files.

In order to get you jump started with leveraging this dataset, we have posted a small sample of the dataset in S3 in the bixolabs-ptd-demo bucket. Along with that is the Avro JSON schema to access the file. For those unfamiliar with working with Avro files, here’s a sample snippet that illustrates one way of reading them:
Schema schema = Schema.parse(jsonSchemaFile); DataFileReader<Object> reader = new DataFileReader<Object>(avroFile, new GenericDatumReader<Object>(schema)); while (reader.hasNext()) { GenericData.Record obj = (Record) reader.next(); // You can access the fields in this object like this... System.out.println(obj.get("AvroDatum-url")); }
Please take a look, and let us know if there’s any missing raw content that you’d want. We’ve intentionally avoided doing post-processing of the results – this is source data for exactly that type of activity.

Filed under:
Uncategorized by kkrugler

7 Responses to “First Sample of Public Terabyte Dataset”

Hadoop User Group Meetup Talk « Elastic Web Mining | Bixo Labs

April 22nd, 2010 at 6:36 pm

[…] to Doug & Scott for that). Vivek mentioned this new project in his recent blog post about our first release of PTD data, and we’re looking forward to others using this to read/write Avro […]
elhoim

April 26th, 2010 at 2:34 am

Where could i get the sample?
kkrugler

April 26th, 2010 at 7:06 am

The sample Avro file is in S3 at /bixolabs-ptd-demo.

The code to read the file is in GitHub at http://github.com/bixolabs/cascading.avro

Though note that for individual files, you can just use something like:

Schema s = Schema.parse(new File("ptd-sample.json"));
DatumReader dr = new GenericDatumReader(s); DataFileStream in = new DataFileStream(new FileInputStream(new File("ptd-sample.avro")), dr);
while (in.hasNext()) { GenericData.Record o = in.next(); Utf8 charset = (Utf8) o.get("AvroDatum-charset"); if (charset != null && charset.getLength() > 0) { System.out.printf("%s\n", charset); } }
anonymous

October 17th, 2010 at 8:15 pm

Can someone tell me how to access actual dataset and not the sample file?
kkrugler

October 18th, 2010 at 9:54 am

Hi anonymous,

The actual dataset hasn’t been released yet – see previous comments as to why, and my best guess re timing.

Regards,

— Ken
Anonymous

October 18th, 2010 at 3:32 pm

Thanks. But, I cant find your comment where you explain the reason behind the delay. Anyways, I am desperately waiting for this data. Will there be any restrictions on using this data for commercial purpose when it is released?
kkrugler

October 18th, 2010 at 4:17 pm

Hi anonymous,

My mistake, the “previous comments” reference was for this page: http://bixolabs.com/datasets/public-terabyte-dataset-project/

As far as restrictions on commercial use – that’s going to be up to Amazon’s lawyers, but in general your use of this publicly available web data is subject to the same conditions as if you crawled it yourself. Which means respecting copyrights, removal of data upon notice by the owner, etc. all apply.

Regards,

— Ken

First Sample of Public Terabyte Dataset

7 Responses to “First Sample of Public Terabyte Dataset”

Recent Blog Posts

Site Tags