Getting Data onto DASH

Data files on the DASH cluster are located in path /data, normally referred as the scratch space but on DASH this is considered "active" data. Each lab has a data folder owned by the lab's grouper group.

$ ll /data
total 180
drwxrwxrwx 11 root     root                                          16384 Apr 11 15:57 ./
drwxr-xr-x 30 root     root                                           4096 Apr 20 06:59 ../
drwxrwx---  4 root     grouper-duke-group-manager-roles-allenlab     16384 Mar  2 06:26 allenlab/
drwxrwx---  2 owzar001 grouper-duke-group-manager-roles-owzarlab     16384 Mar 16 18:27 dcibioinformatics/
drwxrwx---  4 root     root                                          16384 Mar  2 06:20 itlab/
drwxrwx---  2 root     grouper-duke-group-manager-roles-mcdonnelllab 16384 Apr 11 15:57 mcdonnelllab/
drwxrwx---  6 owzar001 grouper-duke-group-manager-roles-owzarlab     16384 Mar  2 06:25 owzarlab/
drwxrwx---  5 root     grouper-duke-group-manager-roles-reddylab     16384 Mar 29 20:53 reddylab/
drwxrwx---  2 root     grouper-duke-group-manager-roles-sempowskilab 16384 Mar 18 07:36 sempowskilab/
drwxrwx---  5 root     grouper-duke-group-manager-roles-westlab      16384 Mar  2 06:31 westlab/
drwxrwx---  4 root     grouper-duke-group-manager-roles-wraylab      16384 Mar 30 20:52 wraylab/

Importing Data to DASH from Duke Outside of Azure; e.g., HARDAC, DCC, laptop, server, etc.

It is currently recommended to use rsync to upload files to DASH. Utilize one of the following commands to upload folders/files to DASH:

Copy a folder and its contents: rsync -vzhrP source-folder/ netid@dash.duhs.duke.edu:/data/labfolder

In the above example, rsync will copy the contents of "source-folder" to DASH at /data/labfolder.

Copy a single file: rsync -vzhrP singlefile netid@dash.duhs.duke.edu:/data/labfolder

In the above example, the file "singlefile" will be uploaded to DASH at /data/labfolder.

Note the only difference between the two commands is the trailing "/" on the source. Adding the "/" will tell rsync you are asking it to transfer a folder and it's contents. Without the "/" rsync will transfer a single file.

Rsync options explained here: https://manpages.ubuntu.com/manpages/focal/man1/rsync.1.html

NOTE: you can also scp to <NETID>@dash.duhs.duke.edu:/data/labfolder to move data into DASH.

The /data/share is a directory on DASH that can be used to share content and pipelines with other labs.

Importing Public Data Sets

(NOTE: DASH is not approved for use with NIH controlled access data such as dbGap. Please contact DASH support at groups-systems-hpc-dhts@duke.edu for alternatives.)

Go to the link for the public data set (such as in a publication/journal article, etc.)
Copy the address (URL) to the data.
To create a directory into the /data/lab_account directory into which to import the data, from the command line/terminal, you can type $mkdir directory_name.
To copy data into the directory /data/lab_account/directory_name: $wget URL of the file data
To see what is in the directory, type /data/lab_account/directory_name: $ls
To see the content in a file, type $cat filename

Importing Data from Amazon (AWS)

Install the AWS command line client:

on DASH, inside home directory:

1. 'curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" '

2. 'unzip awscliv2.zip'

3. 'cd aws'

4. 'mkdir -p /home/<NetID>/local/aws'

5. './install -i /home/<NetID>/local/aws -b /home/<NetID>/bin'

Successful installation will show the path to the aws binary:

---
(base) dss13@dash1-exec-1:~/aws$ ./install -i /home/dss13/local/aws -b /home/dss13/bin
You can now run: /home/dss13/bin/aws --version
---

You can invoke the aws client from that explicit path, or if desired you can add the binary's location to your $PATH environment variable. To do so:

(on HARDAC) 'vim ~/.bash_profile'
(on DASH) 'vim ~/.profile'

At the bottom of the file in a new line, add:

export PATH=/home/<NetID>/bin/aws:$PATH

then log out and log back in to source the edited profile.

IF you've added the aws binary location to your $PATH, the command 'which aws' (should return the same path shown at the end of the installation)

'aws --version' (show aws cli version)

'aws configure' (select options from menus to configure account settings)

COPY to S3 bucket:

'aws s3 cp /data/<labname> s3://dh-dusom-biochem-hashimi --recursive'

Data Transfer from the Sequencing Genomics Technology (SGT) Core

The SGT Core occasionally delivers data via SFTP. We have opened egress port 22 to dnaseq3.gcb.duke.edu on the cluster login node so users can SFTP to the server:

Submit sftp command on the login node and fill in provided username and password from Sequencing Core email instructions

$ sftp <username>@dnaseq3.gcb.duke.edu
<username> password: 
Connected to dnaseq3.gcb.duke.edu.
sftp> get -r <data-source-path>/

An example of a real use-case and a realistic data transfer speed

$ sftp nicholson_7731@dnaseq3.gcb.duke.edu
$ nicholson_7731@dnaseq3.gcb.duke.edu's password: 
Connected to dnaseq3.gcb.duke.edu.

$ sftp> get -r Wray_7731_220502B6/ time

Fetching /sftp/nicholson_7731/Wray_7731_220502B6/ to time
Retrieving /sftp/nicholson_7731/Wray_7731_220502B6
/sftp/nicholson_7731/Wray_7731_220502B6/VASeq-00031 100%   96MB  35.3MB/s   00:02    
/sftp/nicholson_7731/Wray_7731_220502B6/VASeq-00033 100%  102MB  39.1MB/s   00:02    
/sftp/nicholson_7731/Wray_7731_220502B6/VASeq-00031 100%   48MB  37.5MB/s   00:01    
/sftp/nicholson_7731/Wray_7731_220502B6/VASeq-00031 100%  104MB  39.0MB/s   00:02    
/sftp/nicholson_7731/Wray_7731_220502B6/VASeq-00032 100%   95MB  37.3MB/s   00:02    
/sftp/nicholson_7731/Wray_7731_220502B6/VASeq-00032 100%   56MB  31.3MB/s   00:01