Public Datasets

This is a page where we list public datasets that we’ve used or come across. Comments, corrections, and additional data sources are welcome!

We use datasets for consulting projects, and when we need some juicy data for labs that are part of our big data training courses.

There’s also some slightly out-of-date information from an ACM event that you can find here.

We’ve also started a separate list of commercial datasets.

The information below is organized by the type of data – e.g. APIs vs. RSS feeds vs. data files vs. databases and so on.

Some of this information comes from other lists we’ve found, including:

Data Files

  • Wikipedia – complete data dump for site, in MediaWiki data files. Under Creative Commons/GFDL. Lucene has some code for directly reading these files, otherwise you can set up your own MediaWiki server for a local crawl.
  • Wikimedia – data dumps from all sites.
  • The Common Crawl web crawl corpus. 2 billion pages and counting!
  • ClueWeb09 – 1 billion page crawl (25TB) in 10 languages. Available to researchers for $750 (disk costs). CMU data license agreement.
  • IMDb – Database of movies. Text files available via FTP. Restrictions on usage, but not clear from web site text.
  • The Enron Email Dataset is available from CMU for analysis of social networks in email exchanges, etc.
  • Google Books Ngram Data, under the Creative Commons Attribution 3.0 Unported License.
  • NOAA Integrated Surface Data weather data.
  • DMOZ – Open Directory Project. XML file with lots of classified domains. Data is getting pretty stale, and lots of spam/adult links. Under Open Directory License. Yahoo has an augmented version? As does Google.
  • Movielens – Movie recommendations from GroupLens project @ Univ. of Minnesota File containing 2K movies rated by 1K users each rating 20 movies. Several other similar datasets available from the same site.
  • Netflix Challenge – not sure if data files for challenge are still available.
  • OpenStreetMap – User-contributed world map data. Available for download from Planet.osm. Data is under the OpenStreetmapLicense.
  • SNAP – A general purpose network analysis and graph mining library. It has large network datasets that can be used with their library.
  • WebGraph – A framework to study the web graph. It provides simple ways to manage very large graphs, and sample datasets for use with their framework.
  • Stack Overflow – Dumps of their user-generated content. Under CC license.
  • Web Data Commons – Hyperlink Graph, generated from the Common Crawl dataset.
  • 53.5 billion clicks dataset. Available via disk drive for researchers only, restrictions apply.
  • Stanford University’s Large Network Dataset Collection.
  • UC Berkeley’s Big Data Benchmark dataset. The input data set consists of a set of unstructured HTML documents and two SQL tables which contain summary information. It was generated using Intel’s Hadoop benchmark tools and data sampled from the Common Crawl document corpus.

APIs

Note that for many these, there are restrictions on number of requests/day and usage of the data.

  • Delicious – social network site for link sharing. Also XML-based Python API at pydelicious project.
  • Zillow – real estate information including real and estimated prices. XML-based API.
  • Zebo – lists of things people own and things people want. Useful in developing collaborative filtering/recommendation systems and clustering systems.
  • Kayak – XML-based API.
  • eBay – Online auction data.
  • Yahoo Finance – stock prices and trading volume. Useful in developing financial models predicting stock prices.

Databases

  • Freebase – open database of people, places and things.
  • FLOSSMole – has database of open source projects.
  • ImageNet – an image database organized according to the WordNet hierarchy. URLs are freely available, actual image data requires research license.

Web Pages

Below are some web sites that contain significant amounts of data that can be freely used.

  • CIA Factbook -The World Factbook provides information on the history, people, government, economy, geography, communications, transportation, military, and transnational issues for 266 world entities. Data is in Public Domain.
  • Creative Commons Search – not the data itself, but a way to find data under Creative Commons license.