<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments for Big Data Solutions | Scale Unlimited</title>
	<atom:link href="http://www.scaleunlimited.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.scaleunlimited.com</link>
	<description>Hadoop, Solr and Cascading consulting and training</description>
	<lastBuildDate>Fri, 13 May 2011 20:33:48 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>Comment on Public Datasets by kkrugler</title>
		<link>http://www.scaleunlimited.com/datasets/public-datasets/comment-page-1/#comment-629</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Fri, 13 May 2011 20:33:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?page_id=92#comment-629</guid>
		<description>Hi Arjun,

For valid domains, I would suggest using the Alexa and Quantcast top 1M domain lists.

I&#039;ve added you to the list of people to update when our dataset is ready.

Regards,

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi Arjun,</p>
<p>For valid domains, I would suggest using the Alexa and Quantcast top 1M domain lists.</p>
<p>I&#8217;ve added you to the list of people to update when our dataset is ready.</p>
<p>Regards,</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Public Datasets by Arjun</title>
		<link>http://www.scaleunlimited.com/datasets/public-datasets/comment-page-1/#comment-628</link>
		<dc:creator>Arjun</dc:creator>
		<pubDate>Fri, 13 May 2011 06:07:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?page_id=92#comment-628</guid>
		<description>Hi,

I am pursuing Research on Web Mining, actually for research i need Date Set of available Web Domains (website addresses updated till 2011/10). 

Please reply to my mail id that where could i get it?

Thanking you,

Regards,,,
Arjun.</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>I am pursuing Research on Web Mining, actually for research i need Date Set of available Web Domains (website addresses updated till 2011/10). </p>
<p>Please reply to my mail id that where could i get it?</p>
<p>Thanking you,</p>
<p>Regards,,,<br />
Arjun.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Announcing the Public Terabyte Dataset project by kkrugler</title>
		<link>http://www.scaleunlimited.com/2009/11/01/announcing-the-public-terabyte-dataset-project/comment-page-1/#comment-622</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Thu, 17 Mar 2011 00:12:03 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=88#comment-622</guid>
		<description>Hi Ian,

Unfortunately the crawl is still tied to a pending release of some new functionality by Amazon, so we&#039;re still in a (very long) holding pattern. But there&#039;s some light at the end of the tunnel...

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi Ian,</p>
<p>Unfortunately the crawl is still tied to a pending release of some new functionality by Amazon, so we&#8217;re still in a (very long) holding pattern. But there&#8217;s some light at the end of the tunnel&#8230;</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Announcing the Public Terabyte Dataset project by Ian Upright</title>
		<link>http://www.scaleunlimited.com/2009/11/01/announcing-the-public-terabyte-dataset-project/comment-page-1/#comment-621</link>
		<dc:creator>Ian Upright</dc:creator>
		<pubDate>Fri, 11 Mar 2011 07:27:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=88#comment-621</guid>
		<description>Any news on the status of this crawl?  Any update on when might it be available?  Thanks..</description>
		<content:encoded><![CDATA[<p>Any news on the status of this crawl?  Any update on when might it be available?  Thanks..</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on First Sample of Public Terabyte Dataset by kkrugler</title>
		<link>http://www.scaleunlimited.com/2010/04/21/first-sample-of-public-terabyte-dataset/comment-page-1/#comment-57</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Mon, 18 Oct 2010 23:17:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=230#comment-57</guid>
		<description>Hi anonymous,

My mistake, the &quot;previous comments&quot; reference was for this page: http://bixolabs.com/datasets/public-terabyte-dataset-project/

As far as restrictions on commercial use - that&#039;s going to be up to Amazon&#039;s lawyers, but in general your use of this publicly available web data is subject to the same conditions as if you crawled it yourself. Which means respecting copyrights, removal of data upon notice by the owner, etc. all apply.

Regards,

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi anonymous,</p>
<p>My mistake, the &#8220;previous comments&#8221; reference was for this page: <a href="http://bixolabs.com/datasets/public-terabyte-dataset-project/" rel="nofollow">http://bixolabs.com/datasets/public-terabyte-dataset-project/</a></p>
<p>As far as restrictions on commercial use &#8211; that&#8217;s going to be up to Amazon&#8217;s lawyers, but in general your use of this publicly available web data is subject to the same conditions as if you crawled it yourself. Which means respecting copyrights, removal of data upon notice by the owner, etc. all apply.</p>
<p>Regards,</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on First Sample of Public Terabyte Dataset by Anonymous</title>
		<link>http://www.scaleunlimited.com/2010/04/21/first-sample-of-public-terabyte-dataset/comment-page-1/#comment-56</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Mon, 18 Oct 2010 22:32:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=230#comment-56</guid>
		<description>Thanks. But, I cant find your comment where you explain the reason behind the delay. Anyways, I am desperately waiting for this data.  Will there be any restrictions on using this data for commercial purpose when it is released?</description>
		<content:encoded><![CDATA[<p>Thanks. But, I cant find your comment where you explain the reason behind the delay. Anyways, I am desperately waiting for this data.  Will there be any restrictions on using this data for commercial purpose when it is released?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on First Sample of Public Terabyte Dataset by kkrugler</title>
		<link>http://www.scaleunlimited.com/2010/04/21/first-sample-of-public-terabyte-dataset/comment-page-1/#comment-55</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Mon, 18 Oct 2010 16:54:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=230#comment-55</guid>
		<description>Hi anonymous,

The actual dataset hasn&#039;t been released yet - see previous comments as to why, and my best guess re timing.

Regards,

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi anonymous,</p>
<p>The actual dataset hasn&#8217;t been released yet &#8211; see previous comments as to why, and my best guess re timing.</p>
<p>Regards,</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on First Sample of Public Terabyte Dataset by anonymous</title>
		<link>http://www.scaleunlimited.com/2010/04/21/first-sample-of-public-terabyte-dataset/comment-page-1/#comment-52</link>
		<dc:creator>anonymous</dc:creator>
		<pubDate>Mon, 18 Oct 2010 03:15:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?p=230#comment-52</guid>
		<description>Can someone tell me how to access actual dataset and not the sample file?</description>
		<content:encoded><![CDATA[<p>Can someone tell me how to access actual dataset and not the sample file?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Public Terabyte Dataset Project by Joseph Turian</title>
		<link>http://www.scaleunlimited.com/datasets/public-terabyte-dataset-project/comment-page-1/#comment-42</link>
		<dc:creator>Joseph Turian</dc:creator>
		<pubDate>Sat, 09 Oct 2010 15:59:07 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?page_id=101#comment-42</guid>
		<description>Ken,

Great! I am most excited to see the dataset when it is available. Let me know if you would like me to write followup posts to help you announce it.

I would also be interested in seeing a writeup of the technical challenges you encountered, and how you resolved them. But I&#039;m most eager to play with the actual data!</description>
		<content:encoded><![CDATA[<p>Ken,</p>
<p>Great! I am most excited to see the dataset when it is available. Let me know if you would like me to write followup posts to help you announce it.</p>
<p>I would also be interested in seeing a writeup of the technical challenges you encountered, and how you resolved them. But I&#8217;m most eager to play with the actual data!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Public Terabyte Dataset Project by kkrugler</title>
		<link>http://www.scaleunlimited.com/datasets/public-terabyte-dataset-project/comment-page-1/#comment-41</link>
		<dc:creator>kkrugler</dc:creator>
		<pubDate>Wed, 15 Sep 2010 19:54:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.scaleunlimited.com/?page_id=101#comment-41</guid>
		<description>Hi Joseph,

I&#039;d love to release the dataset too :)

We&#039;d run into a cost issue using AWS&#039;s SimpleDB for many links (e.g. 1 billion), so had to revert back to storing the crawl state (aka CrawlDB) in Hadoop SequenceFiles. That, plus wanting to use spot instances for better pricing meant re-working some of the (apparently abandoned) hadoop-ec2 scripts that come with Hadoop.

I think we should have a dataset (maybe not the full terabyte, but big) ready in a month, since we&#039;ve dealt with the above two issues.

Regards,

-- Ken</description>
		<content:encoded><![CDATA[<p>Hi Joseph,</p>
<p>I&#8217;d love to release the dataset too <img src='http://www.scaleunlimited.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>We&#8217;d run into a cost issue using AWS&#8217;s SimpleDB for many links (e.g. 1 billion), so had to revert back to storing the crawl state (aka CrawlDB) in Hadoop SequenceFiles. That, plus wanting to use spot instances for better pricing meant re-working some of the (apparently abandoned) hadoop-ec2 scripts that come with Hadoop.</p>
<p>I think we should have a dataset (maybe not the full terabyte, but big) ready in a month, since we&#8217;ve dealt with the above two issues.</p>
<p>Regards,</p>
<p>&#8211; Ken</p>
]]></content:encoded>
	</item>
</channel>
</rss>

