Full Dataset List / Suggest a Dataset
Below are a list of the datasets that we've acquired so far. This is an ever growing list - if you have data that you'd like to see be included in this project, please fill out this form.
Artificial Intelligence / Machine Learning
Berkley Self-Driving Data: Open-source video data from Berkley's Self Driving Program containing 100,000 videos representing more than 1000 hours of driving experience with more than 100 million frames.
Multimedia Commons: Collection of audio and visual features computed for the nearly 100 million Creative Commons-licensed Flickr images and videos in the YFCC100M dataset from Yahoo! Labs, along with ground-truth annotations for selected subsets.
NLP fast.ai: Some of the most important datasets for NLP, with a focus on classification, including IMDb, AG-News, Amazon Reviews (polarity and full), Yelp Reviews (polarity and full), Dbpedia, Sogou News (Pinyin), Yahoo Answers, Wikitext 2 and Wikitext 103, and ACL-2010 French-English 10^9 corpus.
Mevadata: Video data of human activity, both scripted and unscripted, collected with roughly 100 actors over several weeks. The data was collected with 29 cameras with overlapping and non-overlapping fields of view.
Google Ngrams: N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters.
FMA: The Free Music Archive (FMA), an open and easily accessible dataset suitable for evaluating several tasks in MIR, a field concerned with browsing, searching, and organizing large music collections.
Project Gutenberg: Library of over 60,000 free eBooks. Choose among free epub and Kindle eBooks, download them or read them online.
Wikipedia: Multilingual, online encyclopedia created and maintained as an open collaboration project by a community of volunteer editors using a wiki-based editing system.
Offshore Leaks: Information on more than 785,000 offshore entities that are part of the Paradise Papers, the Panama Papers, the Offshore Leaks and the Bahamas Leaks investigations. The data links to people and companies in more than 200 countries and territories.
GDELT V2: This project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images and events driving our global society every second of every day.
Land Maps (LIDAR): The goal of the USGS 3D Elevation Program (3DEP) is to collect elevation data in the form of light detection and ranging (LiDAR) data over the conterminous United States, Hawaii, and the U.S. territories, with data acquired over an 8-year period.
OpenStreetMap: Built by a community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world.
OpenAddress: A global collection of address data sources, open and free-to-use.
Google Landmarks: The second version of the Google Landmarks dataset (GLD-v2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and, test.
GNOMAD v3 (EXAC): The Genome Aggregation Database (gnomAD), is a coalition of investigators seeking to aggregate and harmonize exome and genome sequencing data from a variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.
ENCODE: The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
OpenNeuro: Database of openly-available brain imaging data. The data are shared according to a Creative Commons CC0 license, providing a broad range of brain imaging data to researchers and citizen scientists alike. The database primarily focuses on functional magnetic resonance imaging (fMRI) data, but also includes other imaging modalities including structural and diffusion MRI, electroencephalography (EEG), and magnetoencephalograpy (MEG).
1000 Genomes: International collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.
Landsat 8: An ongoing collection of satellite imagery of all land on Earth produced by the Landsat 8 satellite. Landsat 8 is a collaboration between NASA and the United States Geological Survey (USGS).