Storage on DASH

What storage is available on DASH?

SCRATCH SPACE:

Each execute node in the DASH cluster has 1.2TB of high-performing local SSD scratch storage mounted at /tmp . This scratch storage is not shared between nodes. Data can be staged into and out of /tmp scratch storage to achieve higher IO performance during analysis. Users can also direct any temporary file output produced during analysis to this scratch storage.

For jobs that run across multiple nodes, slurm's sbcast feature (https://slurm.schedmd.com/sbcast.html) can be used to automatically stage data from primary storage to multiple nodes' fast local scratch storage to deliver improved IO performance at each node.

When making use of the local /tmp scratch storage on execute nodes, please be mindful of file persistence, and *clean up your content on /tmp after use*.

PRIMARY (PERMANENT) DATA STORAGE:

The DASH cluster offers 128TB of Lustre storage, housing both the /data and /home filesystem paths. This storage is shared across all nodes, and serves as a transparent intermediary cache where data can be accessed with high IO performance on demand, and then automatically tiered down to permanent blob storage after a certain amount of time has passed since last access. Blob storage is the Azure term for object storage, which is relatively inexpensive (though lower-performing) primary data storage. This cache/blob arrangement allows for a relatively small Lustre caching layer to serve the DASH cluster's IO demands, while providing an invisible bridge to the permanent, backend blob primary data storage.

NOTE: access to the backend, Blob storage is limited to DASH system administrators only.

Individual lab/project Storage Accounts and Blob containers can be requested separately via ServiceNow from the DHTS Storage Team. (These are NOT part of DASH.)


When data are written on Lustre, a policy engine called Robinhood immediately copies them to permanent, backend blob storage. After copying to blob storage, Robinhood removes the data from Lustre based on rules involving file size and last access time. When Robinhood removes a file from Lustre, it leaves behind a minuscule stub file that contains information about the removed content so that Lustre can then retrieve it from blob on demand in a process called "rehydration".

Each transaction to rehydrate a file into Lustre comes with some wait for IO. Rehydration is nearly instantaneous for individual files, though obviously small files are rehydrated into Lustre more quickly than very large files. When the Lustre filesystem calls for many files to be rehydrated from blob at once, the many transactions' waits for IO are compounded, and depending upon the number of transactions, the rehydration process for multiple files may be noticeably slower than the nearly instantaneous retrieval expected for individual files.

You can think of this as analogous to a library with stacks of books (blob storage), a reading room (Lustre cache), and a card catalog (stub files). When you enter the reading room and ask for a book, the librarian uses the card catalog to determine that book's location in the stacks, then goes to that location to retrieve the book. If it's a very heavy book (very large file), the librarian may take longer to carry it to the reading room than a lighter book (small file). These librarians are pretty athletic though, so there's not much difference in the length of time it takes them to deliver a single heavy book versus a single light book.

When you ask for multiple books at once, multiple librarians appear (one for each book) and each one looks up a book's location in the stacks and goes to retrieve it. The weight of the individual books retrieved still matters, but there's an added consideration of how many librarians can pass through the aisles in the stacks at once. Ten librarians retrieving ten books will deliver those ten books to the reading room more quickly than a thousand librarians retrieving a thousand books.

The policies that determine how long Robinhood leaves a file in Lustre before removing it (leaving behind its stub file) are flexible. Administrators can tailor rules based on combinations of file size and last access time in lustre. <DASH archive/release rules will be added here when implemented>