Datasets Available
General
The HPC team makes available a number of public sets that are commonly used in analysis jobs. The data sets are available Read-Only under
/scratch/work/public/ml-datasets/
/vast/work/public/ml-datasets/
We recommend to use version stored at /vast
(when available) to have better read performance
For some of the datasets users must provide a signed usage agreement before accessing
Format
Many datasets are available in the form of '.sqf' file, which can be used with Singularity. For example, in order to use coco dataset, one can run the following commands
$ singularity exec \
--overlay /<path>/pytorch1.8.0-cuda11.1.ext3:ro \
--overlay /vast/work/public/ml-datasets/coco/coco-2014.sqf:ro \
--overlay /vast/work/public/ml-datasets/coco/coco-2015.sqf:ro \
--overlay /vast/work/public/ml-datasets/coco/coco-2017.sqf:ro \
/scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif /bin/bash
$ singularity exec \
--overlay /<path>/pytorch1.8.0-cuda11.1.ext3:ro \
--overlay /vast/work/public/ml-datasets/coco/coco-2014.sqf:ro \
--overlay /vast/work/public/ml-datasets/coco/coco-2015.sqf:ro \
--overlay /vast/work/public/ml-datasets/coco/coco-2017.sqf:ro \
/scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif find /coco | wc -l
532896
Data Sets
COCO Dataset
About data set: https://cocodataset.org/
Common Objects in Context (COCO) is a large-scale object detection, segmentation, and captioning dataset.
Dataset is available under
/scratch
/scratch/work/public/ml-datasets/coco/coco-2014.sqf
/scratch/work/public/ml-datasets/coco/coco-2015.sqf
/scratch/work/public/ml-datasets/coco/coco-2017.sqf
/vast
/vast/work/public/ml-datasets/coco/coco-2014.sqf
/vast/work/public/ml-datasets/coco/coco-2015.sqf
/vast/work/public/ml-datasets/coco/coco-2017.sqf
ImageNet and ILSVRC
About data set: ImageNet (image-net.org)
ImageNet is an image dataset organized according to the WordNet hierarchy (Miller, 1995). Each concept in WordNet, possibly described by multiple words or word phrases, is called a “synonym set” or “synset”. ImageNet populates 21,841 synsets of WordNet with an average of 650 manually verified and full resolution images. As a result, ImageNet contains 14,197,122 annotated images organized by the semantic hierarchy of WordNet (as of August 2014). ImageNet is larger in scale and diversity than the other image classification datasets (https://arxiv.org/abs/1409.0575).
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept (https://wordnet.princeton.edu/)
ILSVRC (subset of ImageNet)
ILSVRC uses a subset of ImageNet images for training the algorithms and some of ImageNet’s image collection protocols for annotating additional images for testing the algorithms (https://arxiv.org/abs/1409.0575). The name comes from 'ImageNet Large Scale Visual Recognition Challenge (ILSVRC)'. Competition was moved to Kaggle (http://image-net.org/challenges/LSVRC/2017/)
What is included (https://arxiv.org/abs/1409.0575).
- 1000 object classes
- approximately 1.2 million training images
- 50 thousand validation images
- 100 thousand test images
- Size of data is about 150 GB (for train and validation)
Dataset is available under
/scratch/work/public/ml-datasets/imagenet
/vast/work/public/ml-datasets/imagenet
Get access to Data
New York University does not own this dataset.
Please open the ImageNet site, find the terms of use (http://image-net.org/download), copy them, replace the needed parts with your name, send us an email including the terms with your name - thereby confirming you agree to the these terms. Once you do this, we can grant you access to the copy of the dataset on the cluster.
Millions Songs
About data set: https://labrosa.ee.columbia.edu/millionsong/
Dataset is available under
/scratch/work/public/MillionSongDataset
/vast/work/public/ml-datasets/millionsongdataset/
Twitter Decahose
About data set: https://developer.twitter.com/en/docs/twitter-api/enterprise/decahose-api/overview/decahose
NYU has a subscription to Twitter Decahose - 10% random sample of the realtime Twitter Firehose through a streaming connection
Data are stored in GCP cloud (BigQuery) and on HPC clusters Greene and Peel (Parquet format).
Please contact Megan Brown at The Center for Social Media & Politics to get access to data and learn the tools available to work with it.
On cluster dataset is available under (given that you have permissions)
/scratch/work/twitter_decahose/
ProQuest Congressional Record
About data set: ProQuest Congressional Record
The ProQuest Congressional Record text-as-data collection consists of machine-readable files capturing the full text and a small number of metadata fields for a full run of the Congressional Record between 1789 and 2005. Metadata fields include the date of publication, subjects (for issues for which such information exists in the ProQuest system), and URLs linking the full text to the canonical online record for that issue on the ProQuest Congressional platform. A total of 31,952 issues are available.
Dataset is available under:
/scratch/work/public/proquest/
C4
About data set: c4 | TensorFlow Datasets
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: https://commoncrawl.org
Dataset is available under
/scratch/work/public/ml-datasets/c4
/vast/work/public/ml-datasets/c4
GQA
About data set: GQA: Visual Reasoning in the Real World (stanford.edu)
Question Answering on Image Scene Graphs
Dataset is available under
/scratch/work/public/ml-datasets/gqa
/vast/work/public/ml-datasets/gqa
MJSynth
About data set: Visual Geometry Group - University of Oxford
This is synthetically generated dataset which found to be sufficient for training text recognition on real-world images
This dataset consists of 9 million images covering 90k English words, and includes the training, validation and test splits used in the author's work (archived dataset is about 10 GB)
Dataset is available under
/vast/work/public/ml-datasets/mjsynth
open-images-dataset
About data set: Open Images Dataset – opensource.google
A dataset of ~9 million varied images with rich annotations
The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). It contains image-level labels annotations, object bounding boxes, object segmentations, visual relationships, localized narratives, and more
Dataset is available under
/scratch/work/public/ml-datasets/open-images-dataset
/vast/work/public/ml-datasets/open-images-dataset
Pile
About data set: The Pile (eleuther.ai)
The Pile is a 825 GiB diverse, open source language modeling data set that consists of 22 smaller, high-quality datasets combined together.
Dataset is available under
/scratch/work/public/ml-datasets/pile
/vast/work/public/ml-datasets/pile
Waymo open dataset
About data set: Open Dataset – Waymo
The field of machine learning is changing rapidly. Waymo is in a unique position to contribute to the research community with some of the largest and most diverse autonomous driving datasets ever released.
Dataset is available under
/vast/work/public/ml-datasets/waymo_open_dataset_scene_flow
/vast/work/public/ml-datasets/waymo_open_dataset_v_1_2_0_individual_files
/vast/work/public/ml-datasets/waymo_open_dataset_v_1_3_2_individual_files
/vast/work/public/ml-datasets/waymo_open_dataset_v_1_4_1_individual_files