<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Big Data Solutions &#124; Scale Unlimited</title>
	<atom:link href="http://www.scaleunlimited.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.scaleunlimited.com</link>
	<description>Hadoop, Solr and Cascading consulting and training</description>
	<lastBuildDate>Thu, 02 Feb 2012 18:15:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>A (very) short intro to Hadoop</title>
		<link>http://www.scaleunlimited.com/2011/12/19/a-very-short-intro-to-hadoop/</link>
		<comments>http://www.scaleunlimited.com/2011/12/19/a-very-short-intro-to-hadoop/#comments</comments>
		<pubDate>Mon, 19 Dec 2011 14:59:17 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=600</guid>
		<description><![CDATA[And here are the slides from the short talk on Hadoop I gave at the BigDataCamp event held in Washington DC. A (very) short intro to Hadoop View more presentations from Ken Krugler]]></description>
			<content:encoded><![CDATA[<p>And here are the slides from the short talk on Hadoop I gave at the BigDataCamp event held in Washington DC.</p>
<div style="width:425px" id="__ss_10637386"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/kkrugler/a-very-short-intro-to-hadoop" title="A (very) short intro to Hadoop" target="_blank">A (very) short intro to Hadoop</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/10637386" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/kkrugler" target="_blank">Ken Krugler</a> </div>
</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2011/12/19/a-very-short-intro-to-hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A (very) short history of big data</title>
		<link>http://www.scaleunlimited.com/2011/12/19/a-very-short-history-of-big-data/</link>
		<comments>http://www.scaleunlimited.com/2011/12/19/a-very-short-history-of-big-data/#comments</comments>
		<pubDate>Mon, 19 Dec 2011 14:47:51 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=597</guid>
		<description><![CDATA[I finally got around to posting slides from the lightening talk I gave at the BigDataCamp event held in Washington, DC this past November. A (very) short history of big data View more presentations from Ken Krugler]]></description>
			<content:encoded><![CDATA[<p>I finally got around to posting slides from the lightening talk I gave at the <a href="http://www.bigdatacamp.org/dc/2011-11-07/" target="_blank">BigDataCamp</a> event held in Washington, DC this past November.</p>
<div style="width:425px" id="__ss_10632926"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/kkrugler/a-very-short-history-of-big-data" title="A (very) short history of big data" target="_blank">A (very) short history of big data</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/10632926" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/kkrugler" target="_blank">Ken Krugler</a> </div>
</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2011/12/19/a-very-short-history-of-big-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bay Area Hadoop User Group talk</title>
		<link>http://www.scaleunlimited.com/2011/09/03/bay-area-hadoop-user-group-talk/</link>
		<comments>http://www.scaleunlimited.com/2011/09/03/bay-area-hadoop-user-group-talk/#comments</comments>
		<pubDate>Sat, 03 Sep 2011 14:12:39 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[event]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=454</guid>
		<description><![CDATA[Last week I have a talk at the August HUG meetup on my current favorite topic &#8211; using search (or rather, Solr as a NoSQL solution) to improve big data analytics. It&#8217;s the same general theme I covered at the Basis Technology conference in June &#8211; Hadoop is often used to convert petabytes of data [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I have a talk at the <a href="http://www.meetup.com/hadoop/events/16783613/">August HUG meetup</a> on my current favorite topic &#8211; using search (or rather, Solr as a NoSQL solution) to improve big data analytics.</p>
<p>It&#8217;s the same general theme I covered at the Basis Technology conference in June &#8211; Hadoop is often used to convert petabytes of data into pie charts, but without the ability to poke at the raw data, it&#8217;s often hard to understand and validate those results.</p>
<p>In the good old days of small data, you could pull out spreadsheets and dive into the raw data, but that&#8217;s no longer feasible when you&#8217;re processing multi-terabyte datasets.</p>
<p>Solr provides a way to query data efficiently, using it as a poor man&#8217;s NoSQL key-value store. Using something like the <a href="https://github.com/bixolabs/cascading.solr/">Cascading Solr scheme</a> we created, it&#8217;s trivial to generate a Solr index as part of the workflow. And setting up an on-demand Solr instance is also easy, so you once again have the ability to see (query/count/inspect) the data behind the curtain.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2011/09/03/bay-area-hadoop-user-group-talk/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bixo Labs/Cascading case study posted</title>
		<link>http://www.scaleunlimited.com/2011/09/02/bixo-labscascading-case-study-posted/</link>
		<comments>http://www.scaleunlimited.com/2011/09/02/bixo-labscascading-case-study-posted/#comments</comments>
		<pubDate>Fri, 02 Sep 2011 17:22:15 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cascading]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[solr]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=448</guid>
		<description><![CDATA[We&#8217;re heavy users of the Cascading open source project, which lets us quickly build Hadoop-based workflows to solve custom data processing problems. Concurrent recently posted a Bixo Labs Case Study that describes how we use Cascading, and the benefits to us (and thus to our customers). They also listed the various Cascading-related open source projects [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re heavy users of the <a href="http://www.cascading.org/">Cascading</a> open source project, which lets us quickly build Hadoop-based workflows to solve custom data processing problems.</p>
<p>Concurrent recently posted a <a href="http://www.concurrentinc.com/casestudies/bixo_labs">Bixo Labs Case Study</a> that describes how we use Cascading, and the benefits to us (and thus to our customers). They also listed the various Cascading-related open source projects we sponsor, including the <a href="http://github.com/bixolabs/cascading.solr/">Solr scheme</a> that makes it trivial to generate Solr search indexes from a scalable workflow.</p>
<p>I even had to create one of those classic, vacuous architectural diagrams&#8230;</p>
<p><a href="http://www.scaleunlimited.com/wp-content/uploads/2011/09/Bixo-Labs-Architecture.png"><img src="http://www.scaleunlimited.com/wp-content/uploads/2011/09/Bixo-Labs-Architecture-300x185.png" alt="" title="Bixo Labs Architecture" width="300" height="185" class="alignnone size-medium wp-image-452" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2011/09/02/bixo-labscascading-case-study-posted/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Talk on using search with big data analytics</title>
		<link>http://www.scaleunlimited.com/2011/07/08/talk-on-using-search-with-big-data-analytics/</link>
		<comments>http://www.scaleunlimited.com/2011/07/08/talk-on-using-search-with-big-data-analytics/#comments</comments>
		<pubDate>Fri, 08 Jul 2011 17:20:40 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[government]]></category>
		<category><![CDATA[presentation]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=393</guid>
		<description><![CDATA[A few weeks back I was at the Basis Technology Government Users Conference in Washington, DC. It was an interesting experience, meeting people from agencies responsible for processing lots of important data. One thing I noticed is that in the Bay area, your name tag at an event tries to convey that you&#8217;re working on [...]]]></description>
			<content:encoded><![CDATA[<p>A few weeks back I was at the Basis Technology Government Users Conference in Washington, DC. It was an interesting experience, meeting people from agencies responsible for processing lots of important data. One thing I noticed is that in the Bay area, your name tag at an event tries to convey that you&#8217;re working on super-cool stuff. Here in DC, it&#8217;s more cool to be classified. For example, name tags that say &#8220;USG&#8221; &#8211; a generic term for &#8220;US Government&#8221;, and a common code term for &#8220;That&#8217;s Classified&#8221;.</p>
<p>My talk was about how search (at scale) is becoming a critical component of big data analytics. Without the ability to poke at the raw data, it&#8217;s very hard to validate and understand the high level results of processing lots and lots of bits down to a few graphs and tables.</p>
<p>Basis has published the slides <a href="http://www.basistech.com/pdf/events/government-users-conference/guc-2011-krugler-seeing-the-forest.pdf" target="_blank">here</a>, for your reading pleasure.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2011/07/08/talk-on-using-search-with-big-data-analytics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cascading Avro Tap performance</title>
		<link>http://www.scaleunlimited.com/2011/03/18/cascading-avro-tap-performance/</link>
		<comments>http://www.scaleunlimited.com/2011/03/18/cascading-avro-tap-performance/#comments</comments>
		<pubDate>Fri, 18 Mar 2011 20:06:47 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[avro]]></category>
		<category><![CDATA[cascading]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=388</guid>
		<description><![CDATA[Back in January, Matt Pouttu-Clarke posted his results from using the Cascading Avro tap we&#8217;d created a while back. The most interesting result was comparing performance between parsing CSV files and reading Avro files: 13.5x faster is a nice improvement over the very common practice of using text files for information exchange. Side note: we [...]]]></description>
			<content:encoded><![CDATA[<p>Back in January, Matt Pouttu-Clarke <a href="http://mpouttuclarke.wordpress.com/2011/01/13/cascading-avro/">posted his results</a> from using the <a href="https://github.com/bixolabs/cascading.avro">Cascading Avro tap</a> we&#8217;d created a while back.</p>
<p>The most interesting result was comparing performance between parsing CSV files and reading Avro files:</p>
<div id="attachment_389" class="wp-caption alignnone" style="width: 310px"><a href="http://www.scaleunlimited.com/wp-content/uploads/2011/03/avro-parse-sec.png"><img src="http://www.scaleunlimited.com/wp-content/uploads/2011/03/avro-parse-sec-300x180.png" alt="Avro vs CSV parsing time" title="avro-parse-sec" width="300" height="180" class="size-medium wp-image-389" /></a><p class="wp-caption-text">Time to parse files (shorter is better)</p></div>
<p>13.5x faster is a nice improvement over the very common practice of using text files for information exchange.</p>
<p>Side note: we recently released the 1.0 version, and pushed it to the <a href="http://conjars.org/com.bixolabs/cascading.avro">Conjars repository</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2011/03/18/cascading-avro-tap-performance/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Presenting at Strata Conference Tutorial on Hadoop</title>
		<link>http://www.scaleunlimited.com/2011/01/27/presenting-at-strata-conference-tutorial-on-hadoop/</link>
		<comments>http://www.scaleunlimited.com/2011/01/27/presenting-at-strata-conference-tutorial-on-hadoop/#comments</comments>
		<pubDate>Thu, 27 Jan 2011 22:37:00 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AWS]]></category>
		<category><![CDATA[emr]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=383</guid>
		<description><![CDATA[This coming Tuesday, Feb 1st I&#8217;ll be helping at the &#8220;How to Develop Big Data Applications for Hadoop&#8221; tutorial. My specific sections will cover the &#8220;why&#8221; of using Amazon Web Services for Hadoop (hint &#8211; scaling, simplicity, savings) and the &#8220;how&#8221; &#8211; mostly discussing the nuts and bolts of running Hadoop jobs using Elastic MapReduce. [...]]]></description>
			<content:encoded><![CDATA[<table>
<tr>
<td valign="top">
<a href="http://strataconf.com"><br />
<img src="http://assets.en.oreilly.com/1/event/55/strata2011_spkr_210x60.jpg" width="210" height="60"  border="0"  alt="Strata 2011" title="Strata 2011"  /><br />
</a>
</td>
<td width="400" align="center">
This coming Tuesday, Feb 1st I&#8217;ll be helping at the &#8220;<a href="http://strataconf.com/strata2011/public/schedule/detail/17028" target="_blank">How to Develop Big Data Applications for Hadoop</a>&#8221; tutorial.
</td>
</tr>
</table>
<p>My specific sections will cover the &#8220;why&#8221; of using <a href="http://aws.amazon.com/">Amazon Web Services</a> for Hadoop (hint &#8211; scaling, simplicity, savings) and the &#8220;how&#8221; &#8211; mostly discussing the nuts and bolts of running Hadoop jobs using <a href="http://aws.amazon.com/elasticmapreduce/">Elastic MapReduce</a>. I&#8217;ll also be roaming the room during the hands-on section, helping out the attendees.</p>
<p>I&#8217;m looking forward to the tutorial, and also the <a href="http://strataconf.com/strata2011">Strata Conference</a> itself. Lots of interesting topics, and people (like <a href="http://www.readwriteweb.com/hack/author/pete-warden.php">Pete Warden</a>) that I&#8217;ve always wanted to meet.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2011/01/27/presenting-at-strata-conference-tutorial-on-hadoop/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Focused web crawling</title>
		<link>http://www.scaleunlimited.com/2010/06/18/focused-web-crawling/</link>
		<comments>http://www.scaleunlimited.com/2010/06/18/focused-web-crawling/#comments</comments>
		<pubDate>Fri, 18 Jun 2010 14:50:38 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[web crawler]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=277</guid>
		<description><![CDATA[Recently some customers have been asking for a more concrete description of how we handle &#8220;focused web crawling&#8221; at Bixo Labs. After answering the same questions a few times, it seemed like a good idea to post details to our web site &#8211; thus the new page titled Focused Crawling. The basic concepts are straightforward, [...]]]></description>
			<content:encoded><![CDATA[<p>Recently some customers have been asking for a more concrete description of how we handle &#8220;focused web crawling&#8221; at Bixo Labs.</p>
<p>After answering the same questions a few times, it seemed like a good idea to post details to our web site &#8211; thus the new page titled <a href="/about/focused-crawler" target="_self">Focused Crawling</a>.</p>
<p>The basic concepts are straightforward, and very similar to what we did at Krugle to efficiently find web pages that were likely to be of interest to software developers. In Bixo Labs we&#8217;ve generalized the concept a bit, and implemented it using Bixo and a Cascading workflow. This gives us a lot more flexibility when it comes to customizing the behavior, as well as making it easier for us to work with customer-provided code for extension points such as scoring pages.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2010/06/18/focused-web-crawling/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop User Group Meetup Talk</title>
		<link>http://www.scaleunlimited.com/2010/04/22/hadoop-user-group-meetup-talk/</link>
		<comments>http://www.scaleunlimited.com/2010/04/22/hadoop-user-group-meetup-talk/#comments</comments>
		<pubDate>Fri, 23 Apr 2010 01:35:34 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[avro]]></category>
		<category><![CDATA[cascading]]></category>
		<category><![CDATA[elastic mapreduce]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[public terabyte dataset]]></category>
		<category><![CDATA[simpledb]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=239</guid>
		<description><![CDATA[Last night I did a presentation at the April Hadoop Bay Area User Group meetup, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow. Dekel has posted the slides of my talk, as well as a (very quiet) video. My talk was on the status of the Public Terabyte [...]]]></description>
			<content:encoded><![CDATA[<p>Last night I did a presentation at the <a href="http://www.meetup.com/hadoop/calendar/13002132/" target="_blank">April Hadoop Bay Area User Group meetup</a>, hosted by Yahoo. 250+ people in attendance, so the interest in Hadoop continues to grow.</p>
<p>Dekel has posted the <a href="http://www.slideshare.net/hadoopusergroup/bixo-hug-talk" target="_blank">slides</a> of my talk, as well as a (very quiet) <a href="http://www.youtube.com/watch?v=VIIi8DjQbzI&amp;feature=channel" target="_blank">video</a>.</p>
<p>My talk was on the status of the <a href="http://www.scaleunlimited.com/datasets/public-terabyte-dataset-project/" target="_blank">Public Terabyte Dataset (PTD) project</a>, and advice on running jobs in Amazon&#8217;s <a href="http://aws.amazon.com/elasticmapreduce/" target="_blank">Elastic MapReduce</a> (EMR) cloud. As part of the PTD architecture, we wound up using Amazon&#8217;s <a href="http://aws.amazon.com/simpledb/" target="_blank">SimpleDB</a> for storing the crawl DB, thus one section of my talk was on what we learned about using that to efficiently and inexpensively save persistent data (crawl state) while still using EMR for bursty processing. I&#8217;d previously blogged about our <a href="http://www.scaleunlimited.com/2010/03/16/simpledb-tap-for-cascading/" target="_blank">SimpleDB tap &amp; scheme for Cascading</a>, and our use of it for PTD has helped shake out some bugs.</p>
<p>As well, we decided to use <a href="http://hadoop.apache.org/avro/" target="_blank">Apache Avro</a> for our output format. This meant creating a Cascading scheme, which would have been pretty painful but for the fortuitous, just-in-time release of Hadoop mapreduce support code in the Avro project (thanks to Doug &amp; Scott for that). Vivek mentioned this new project in his recent blog post about our <a href="http://www.scaleunlimited.com/2010/04/21/first-sample-of-public-terabyte-dataset/" target="_blank">first release of PTD data</a>, and we&#8217;re looking forward to others using this to read/write Avro files.</p>
<p>The real-world use case I described in my talk was analyzing the quality of the <a href="http://lucene.apache.org/tika/" target="_blank">Tika</a> charset detection, using HTML data from our initial crawl dataset. The results showed plenty of room for improvement <img src='http://www.scaleunlimited.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<div id="attachment_242" class="wp-caption alignnone" style="width: 610px"><a href="http://69.89.27.220/~bixolabs/wp-content/uploads/2010/04/charset-analysis.png"><img class="size-full wp-image-242" title="charset-analysis" src="http://69.89.27.220/~bixolabs/wp-content/uploads/2010/04/charset-analysis.png" alt="" width="600" height="388" /></a><p class="wp-caption-text">Tika accuracy detecting character sets</p></div>
<p>The real point of this use case wasn&#8217;t to point out problems with Tika, but rather to demonstrate how easy it is to use the dataset to perform this type of analysis. Which means it&#8217;s also easy to compare alternative algorithms, and improve the Tika support with a large enough dataset to inspire confidence in the end results.</p>
<p>As an aside, Ted Dunning might be using this data &amp; Mahout to train a better charset and/or langauge classifier, which would be a really nice addition to the Tika project. The same thing could obviously be done for language detection, which currently also suffers from similar accuracy issues, as well as being a CPU cycle hog.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2010/04/22/hadoop-user-group-meetup-talk/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>First Sample of Public Terabyte Dataset</title>
		<link>http://www.scaleunlimited.com/2010/04/21/first-sample-of-public-terabyte-dataset/</link>
		<comments>http://www.scaleunlimited.com/2010/04/21/first-sample-of-public-terabyte-dataset/#comments</comments>
		<pubDate>Wed, 21 Apr 2010 16:01:35 +0000</pubDate>
		<dc:creator>kkrugler</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[avro]]></category>
		<category><![CDATA[cascading]]></category>
		<category><![CDATA[public terabyte dataset]]></category>

		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=230</guid>
		<description><![CDATA[We are excited that the Public Terabyte Dataset project is starting to release data. We decided to go with the Avro file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we&#8217;re using Cascading for this project, we have also released a Cascading Avro Scheme to read and [...]]]></description>
			<content:encoded><![CDATA[<p>We are excited that the <a href="http://www.scaleunlimited.com/datasets/public-terabyte-dataset-project/">Public Terabyte Dataset</a> project is starting to release data. We decided to go with the <a href="http://hadoop.apache.org/avro/">Avro</a> file format, instead of WARC, as Avro is more efficient (easily splittable by Hadoop) and cross-language. Since we&#8217;re using <a href="http://www.cascading.org/">Cascading</a> for this project, we have also released a Cascading <a href="http://github.com/bixolabs/cascading.avro">Avro Scheme</a> to read and write Avro files.</p>
<p>In order to get you jump started with leveraging this dataset, we have posted a small sample of the dataset in S3 in the bixolabs-ptd-demo bucket. Along with that is the <a href="http://s3.amazonaws.com/bixolabs-ptd-demo/ptd-sample.json">Avro JSON</a> schema to access the file. For those unfamiliar with working with Avro files, here&#8217;s a sample snippet that illustrates one way of reading them:<br />
<code><br />
Schema schema = Schema.parse(jsonSchemaFile);<br />
DataFileReader&lt;Object&gt;  reader = new DataFileReader&lt;Object&gt;(avroFile, new GenericDatumReader&lt;Object&gt;(schema));<br />
while (reader.hasNext()) {<br />
GenericData.Record obj =  (Record) reader.next();<br />
// You can access the fields in this object like this...<br />
System.out.println(obj.get("AvroDatum-url"));<br />
}<br />
</code><br />
Please take a look, and let us know if there&#8217;s any missing raw content that you&#8217;d want. We&#8217;ve intentionally avoided doing post-processing of the results &#8211; this is source data for exactly that type of activity.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.scaleunlimited.com/2010/04/21/first-sample-of-public-terabyte-dataset/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

