Skip to main content

HPC Storage

The NYU HPC clusters are served by a General Parallel File System (GPFS) cluster and an all Flash VAST storage cluster.

The NYU HPC team supports data storage, transfer, and archival needs on the HPC clusters, as well as collaborative research services like the Research Project Space (RPS).

Highlights

  • 9.5 PB Total GPFS Storage
    • Up to 78 GB per second read speeds
    • Up to 650k input/output operations per second (IOPS)
  • Research Project Space (RPS): RPS volumes provide working spaces for sharing data and code amongst project or lab members

Introduction to HPC Data Management

The NYU HPC Environment provides access to a number of file systems to better serve the needs of researchers managing data during the various stages of the research data lifecycle (data capture, analysis, archiving, etc.). Each HPC file system comes with different features, policies, and availability.

In addition, a number of data management tools are available that enable data transfers and data sharing, recommended best practices, and various scenarios and use cases of managing data in the HPC Environment.

Multiple public data sets are available to all users of the HPC environment, such as a subset of The Cancer Genome Atlas (TCGA), the Million Song Database, ImageNet, and Reference Genomes.

Below is a list of file systems with their characteristics and a summary table. Reviewing the list of available file systems and the various Scenarios/Use cases that are presented below, can help select the right file systems for a research project. As always, if you have any questions about data storage in the HPC environment, you can request a consultation with the HPC team by sending email to hpc@nyu.edu.

Data Security Warning

warning

Moderate Risk Data - HPC Approved

  • The HPC Environment has been approved for storing and analyzing Moderate Risk research data, as defined in the NYU Electronic Data and System Risk Classification Policy.
  • High Risk research data, such as those that include Personal Identifiable Information (PII) or electronic Protected Health Information (ePHI) or Controlled Unclassified Information (CUI) should NOT be stored in the HPC Environment.
note

only the Office of Sponsored Projects (OSP) and Global Office of Information Security (GOIS) are empowered to classify the risk categories of data.

tip

High Risk Data - Secure Research Data Environments (SRDE) Approved

Because the HPC system is not approved for High Risk data, we recommend using an approved system like the Secure Research Data Environments (SRDE).

Data Storage options in the HPC Environment

User Home Directories

Every individual user has a home directory (under /home/$USER, environment variable $HOME) for permanently storing code and important configuration files. Home Directories provide limited storage space (50 GB) and inodes (files) 30,000 per user. Users can check their quota utilization using the myquota command.

User home directories are backed up daily and old files under $HOME are not purged.

The User home directories are available on all HPC clusters (Greene) and on every cluster node (login nodes, compute nodes) as well as and Data Transfer Node (gDTN).

warning

Avoid changing file and directory permissions in your home directory to allow other users to access files.

User Home Directories are not ideal for sharing files and folders with other users. HPC Scratch or Research Project Space (RPS) are better file systems for sharing data.

warning

One of the common issues that users report regarding their home directories is running out of inodes, i.e. the number of files stored under their home exceeds the inode limit, which by default is set to 30,000 files. This typically occurs when users install software under their home directories, for example, when working with Conda and Julia environments, that involve many small files.

tip
  • To find out the current space and inode quota utilization and the distribution of files under your home directory, please see: Understanding user quota limits and the myquota command.
  • Working with Conda environments: To avoid running out of inode limits in home directories, the HPC team recommends setting up conda environments with Singularity overlay images

HPC Scratch

The HPC scratch file system is the HPC file system where most of the users store research data needed during the analysis phase of their research projects. The scratch file system provides temporary storage for datasets needed for running jobs.

Files stored in the HPC scratch file system are subject to the HPC Scratch old file purging policy: Files on the /scratch file system that have not been accessed for 60 or more days will be purged.

Every user has a dedicated scratch directory (/scratch/$USER) with 5 TB disk quota and 1,000,000 inodes (files) limit per user.

The scratch file system is available on all nodes (compute, login, etc.) on Greene as well as Data Transfer Node (gDTN).

warning

There are No Back ups of the scratch file system. Files that were deleted accidentally or removed due to storage system failures CAN NOT be recovered.

tip
  • Since there are no back ups of HPC Scratch file system, users should not put important source code, scripts, libraries, executables in /scratch. These important files should be stored in file systems that are backed up, such as /home or Research Project Space (RPS). Code can also be stored in a git repository.
  • Old file purging policy on HPC Scratch: All files on the HPC Scratch file system that have not been accessed for more than 60 days will be removed. It is a policy violation to use scripts to change the file access time. Any user found to be violating this policy will have their HPC account locked. A second violation may result in your HPC account being turned off.
  • To find out the user's current disk space and inode quota utilization and the distribution of files under your scratch directory, please see: Understanding user quota Limits and the myquota command.
  • Once a research project completes, users should archive their important files in the HPC Archive file system.

HPC Vast

The HPC Vast all-flash file system is the HPC file system where users store research data needed during the analysis phase of their research projects, particuarly for high I/O data that can bottleneck on the scratch file system. The Vast file system provides temporary storage for datasets needed for running jobs.

Files stored in the HPC vast file system are subject to the HPC Vast old file purging policy: Files on the /vast file system that have not been accessed for 60 or more days will be purged.

Every user has a dedicated vast directory (/vast/$USER) with 2 TB disk quota and 5,000,000 inodes (files) limit per user.

The vast file system is available on all nodes (compute, login, etc.) on Greene as well as Data Transfer Node (gDTN).

warning

There are No Back ups of the vastsc file system. Files that were deleted accidentally or removed due to storage system failures CAN NOT be recovered.

tip
  • Since there are no back ups of HPC Vast file system, users should not put important source code, scripts, libraries, executables in /vast. These important files should be stored in file systems that are backed up, such as /home or Research Project Space (RPS). Code can also be stored in a git repository.
  • Old file purging policy on HPC Vast: All files on the HPC Vast file system that have not been accessed for more than 60 days will be removed. It is a policy violation to use scripts to change the file access time. Any user found to be violating this policy will have their HPC account locked. A second violation may result in your HPC account being turned off.
  • To find out the user's current disk space and inode quota utilization and the distribution of files under your vast directory, please see: Understanding user quota Limits and the myquota command.
  • Once a research project completes, users should archive their important files in the HPC Archive file system.

HPC Research Project Space

The HPC Research Project Space (RPS) provides data storage space for research projects that is easily shared amongst collaborators, backed up, and not subject to the old file purging policy. HPC RPS was introduced to ease data management in the HPC environment and eliminate the need of having to frequently copying files between Scratch and Archive file systems by having all projects files under one area. These benefits of the HPC RPS come at a cost. The cost is determined by the allocated disk space and the number of files (inodes).

HPC Work

The HPC team makes available a number of public datasets that are commonly used in analysis jobs. The data sets are available Read-Only under /scratch/work/public.

For some of the datasets users must provide a signed usage agreement before accessing.

Public datasets available on the HPC clusters can be viewed on the Datasets page.

HPC Archive

Once the Analysis stage of the research data lifecycle has completed, HPC users should tar their data and code into a single tar.gz file and then copy the file to their archive directory (/archive/$USER). The HPC Archive file system is not accessible by running jobs; it is suitable for long-term data storage. Each user has access to a default disk quota of 2TB and 20,000 inode (files) limit. The rather low limit on the number of inodes per user is intentional. The archive file system is available only on login nodes of Greene. The archive file system is backed up daily.

  • Here is an example tar command that combines the data in a directory named my_run_dir under $SCRATCH and outputs the tar file in the user's $ARCHIVE:
# to archive `$SCRATCH/my_run_dir`
tar cvf $ARCHIVE/simulation_01.tar -C $SCRATCH my_run_dir

NYU (Google) Drive

Google Drive (NYU Drive) is accessible from the NYU HPC environment and provides an option to users who wish to archive data or share data with external collaborators who do not have access to the NYU HPC environment.

Currently (March 2021) there is no limit on the amount of data a user can store on Google Drive and there is no cost associated with storing data on Google Drive (although we hear rumors that free storage on Google Drive may be ending soon).

However, there are limits to the data transfer rate in moving to/from Google Drive. Thus, moving many small files to Google Drive is not going to be efficient.

Please read the Instructions on how to use cloud storage within the NYU HPC Environment.

HPC Storage Mounts Comparison Table

Please see the next page for best practices for data management on NYU HPC systems.