Datasets from ACM Data Mining Unconference

At the ACM data mining unconference a while back (Nov 1st, 2009) there was an interesting session on open/public datasets led by Paul O’Rorke. I wound up being the scribe, so below are some notes from that discussion.

But before diving into the details, I’d like to point out some very complete lists of public datasets that have been compiled by Pete Skomoroch and others.

Now for the list discussed during the unconference…

APIs from Programming Collective Intelligence

Paul presented a partial list of data APIs from the book Programming Collective Intelligence: Building Smart Web 2.0 Applications. Many of these will overlap with Pete’s list mentioned above. Also note that for many these, there are restrictions on number of requests/day and usage of the data.

  • Delicious – social network site for link sharing. Also XML-based Python API at pydelicious project.
  • Zillow – real estate information including real and estimated prices. XML-based API.
  • Zebo – lists of things people own and things people want. Useful in developing collaborative filtering/recommendation systems and clustering systems.
  • Kayak – XML-based API.
  • eBay – Online auction data.
  • Yahoo Finance – stock prices and trading volume. Useful in developing financial models predicting stock prices.

Some participants mentioned that several of these APIs seem unreliable – depending on the day of the week, they may or may not work.

Another good source of data APIs is the Programmable Web site.

Data Files

Most of the sources listed below are data files that can be downloaded, though some require sneaker-net (dataset is on disks).

  • Wikipedia – complete data dump for site, in MediaWiki data files. Under Creative Commons/GFDL. Lucene has some code for directly reading these files, otherwise you can set up your own MediaWiki server for a local crawl.
  • Wikimedia – data dumps from all sites.
  • ClueWeb09 – 1 billion page crawl (25TB) in 10 languages. Available to researchers for $750 (disk costs). CMU data license agreement.
  • Public Terabyte Dataset – 1TB compressed data from large scale crawl of top 100K English domains. Work in progress. Hosted by Amazon in S3, freely available to EC2 users.
  • IMDb – Database of movies. Text files available via FTP. Restrictions on usage, but not clear from web site text.
  • Enron Emails – one attendee mentioned that the Enron email dataset is still available for analysis of social networks in email exchanges, etc.
  • DMOZ – Open Directory Project. XML file with lots of classified domains. Data is getting pretty stale, and lots of spam/adult links. Under Open Directory License. Yahoo has an augmented version? As does Google.
  • Movielens – Movie recommendations from GroupLens project @ Univ. of Minnesota File containing 2K movies rated by 1K users each rating 20 movies. Several other similar datasets available from the same site.
  • Netflix Challenge – not sure if data files for challenge are still available.
  • OpenStreetMap – User-contributed world map data. Available for download from Planet.osm. Data is under the OpenStreetmapLicense.
  • SNAP – A general purpose network analysis and graph mining library. It has large network datasets that can be used with their library.
  • WebGraph – A framework to study the web graph. It provides simple ways to manage very large graphs, and sample datasets for use with their framework.
  • Stack Overflow – Dumps of their user-generated content. Under CC license.

Databases

  • Freebase – open database of people, places and things.
  • FLOSSMole – has database of open source projects.
  • ImageNet – an image database organized according to the WordNet hierarchy. URLs are freely available, actual image data requires research license.

Web Pages

Below are some web sites that contain significant amounts of data that can be freely used.

  • CIA Factbook -The World Factbook provides information on the history, people, government, economy, geography, communications, transportation, military, and transnational issues for 266 world entities. Data is in Public Domain.
  • Creative Commons Search – not the data itself, but a way to find data under Creative Commons license.

Comments, corrections, and additional data sources are welcome!