Data Movement: Difference between revisions

From Cheaha
Jump to navigation Jump to search
(update rclone setting video)
 
(29 intermediate revisions by 4 users not shown)
Line 1: Line 1:
There are various tools which you can use to move your data within the HPC cluster, such as [https://linux.die.net/man/1/mv mv], [https://linux.die.net/man/1/cp cp], [https://linux.die.net/man/1/scp scp] etc. One of the most powerful tools for data movement on Linux is [https://linux.die.net/man/1/rsync rsync], which we'll be using in our examples below.
'''NOTE: If you find better and faster methods/tools, please add them to this page as well''


==Procedure==
There are various Linux native commands that you can use to move your data within the HPC cluster, such as [https://linux.die.net/man/1/mv mv], [https://linux.die.net/man/1/cp cp], [https://linux.die.net/man/1/scp scp] etc. One of the most powerful tools for data movement on Linux is [https://linux.die.net/man/1/rsync rsync], which we'll be using in our examples below.
rr


==Job Scripts==
'''rsync''' and '''scp''' can also be used for moving data from a local storage to Cheaha. 


If the data that you are moving is large, then you should always use either an interactive session or a job script for your data movement. This ensures that the process for your data movement isn't using and slowing login nodes for a long time, and instead is performing these operations on a compute node. General rule of thumb is that if your transfer takes more then a minute, then perform that task as a job.
For moving large volumes of data to and from the cluster we also recommend the [[Globus]] service.
 
== Globus ==
 
Please see the dedicated page on moving data with [[Globus]].  This is a good option if the site or user you are collaborating has Globus but may require the installation of a transfer agent on the users computer.
 
== Rclone ==
Here's [https://www.youtube.com/watch?v=UbFJV9TO4KE example video] to setup rclone for [https://box.com box.com].
 
Data Transfer Cheaha to BOX. In the terminal (inside the VNC session), load the module rclone
module load rclone/1.48.0
The initial setup for Box involves getting a token from Box rclone config walks you through it.
Here is an example of how to make a remote called remote. First run:
  rclone config
This will guide you through an interactive setup process:
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> remote
Type of storage to configure.
Choose a number from below, or type in your own value
[snip]
XX / Box
    \ "box"
[snip]
Storage> box
Box App Client Id - leave blank normally.
client_id>
Box App Client Secret - leave blank normally.
client_secret>
Remote config
Use auto config?
  * Say Y if not sure
  * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> y
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
Got code
--------------------
[remote]
client_id =
client_secret =
token = {"access_token":"XXX","token_type":"bearer","refresh_token":"XXX","expiry":"XXX"}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y
 
rclone ls
List the objects in the path with size and path.
rclone ls remote:path [flags]
 
rclone copy
Copy files from source to dest, skipping already copied.
Note: Use the -P/--progress flag to view real-time transfer statistics
rclone copy source:sourcepath dest:destpath
 
==RSync==
To find out more information such as flags, usage etc. about any of the above mentioned tools, you can use '''man TOOL_NAME'''.
<pre>
[build@c0051 ~]$ man rsync
 
NAME
      rsync - a fast, versatile, remote (and local) file-copying tool
 
SYNOPSIS
      Local:  rsync [OPTION...] SRC... [DEST]
 
      Access via remote shell:
        Pull: rsync [OPTION...] [USER@]HOST:SRC... [DEST]
        Push: rsync [OPTION...] SRC... [USER@]HOST:DEST
 
      Access via rsync daemon:
        Pull: rsync [OPTION...] [USER@]HOST::SRC... [DEST]
              rsync [OPTION...] rsync://[USER@]HOST[:PORT]/SRC... [DEST]
        Push: rsync [OPTION...] SRC... [USER@]HOST::DEST
              rsync [OPTION...] SRC... rsync://[USER@]HOST[:PORT]/DEST
 
      Usages with just one SRC arg and no DEST arg will list the source files
      instead of copying.
 
DESCRIPTION
.
.
.
</pre>
 
If you are interested in finding out about various methods of moving data and various  tools which can be used to achieve that aim, this page provides a very good description/guide : [http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html How to transfer large amounts of data via network.].
 
===Moving local data to Cheaha===
To move data from your local storage to cheaha, using rsync, you can use the following command, on your local system:
<pre>
rsync -aP PATH_TO_FILE_OR_DIRECTORY BLAZERID@cheaha.rc.uab.edu:PATH_ON_CHEAHA
</pre>
where,
* PATH_TO_FILE_OR_DIRECTORY is the full/relative path to file or directory that you want to move to cheaha. If transferring a directory, a trailing slash on the source creates an additional directory level at the destination. You can think of a trailing / on a source as meaning "copy the contents of this directory" as opposed to "copy the directory by name"
* PATH_ON_CHEAHA is the full path on cheaha, where you want to move your directory. Remember, you have 20GB of space in your HOME directory, os if you are moving raw data to cheaha, please move it in either, /data/user/$BLAZERID or /data/scratch/$BLAZERID
 
===Moving data on Cheaha to your local storage===
To move data from cheaha to your local storage, using rsync, you can use the following command, on your local system:
<pre>
rsync -aP BLAZERID@cheaha.rc.uab.edu:PATH_TO_FILE_OR_DIRECTORY_ON_CHEAHA PATH_ON_LOCAL_STORAGE
</pre>
where,
* PATH_TO_FILE_OR_DIRECTORY_ON_CHEAHA is the full path to file or directory on cheaha that you want to move to your local storage. If transferring a directory, a trailing slash on the source creates an additional directory level at the destination. You can think of a trailing / on a source as meaning "copy the contents of this directory" as opposed to "copy the directory by name"
* PATH_ON_LOCAL_STORAGE is the full/relative path on cheaha, where you want to move your directory.
 
== Privacy ==
{{SensitiveInformation}}
 
==Moving large amount of data on cheaha==
 
If the data that you are moving is large, then you should always use either an interactive session or a job script for your data movement. This ensures that the process for your data movement isn't using and slowing login nodes for a long time, and instead is performing these operations on a compute node.


===Interactive session===
===Interactive session===
Line 12: Line 129:
* Start an interactive session using srun
* Start an interactive session using srun
<pre>
<pre>
srun --ntasks=4 --mem-per-cpu=4096 --time=08:00:00 --partition=medium --job-name=JOB_NAME --pty /bin/bash
srun --ntasks=1 --mem-per-cpu=1024 --time=08:00:00 --partition=medium --job-name=DATA_TRANSFER --pty /bin/bash
</pre>
</pre>
'''NOTE:''' Please change the time required and the corresponding [https://docs.uabgrid.uab.edu/wiki/SLURM#Slurm_Partitions partition] according to your need.


* Start an rsync process to start the transfer, once you have moved from login001 to c00XX node:
* Start an rsync process to start the transfer, once you have moved from login001 to c00XX node:
Line 45: Line 163:
* After modifications to the given job script, submit it using : '''sbatch JOB_SCRIPT'''
* After modifications to the given job script, submit it using : '''sbatch JOB_SCRIPT'''


==Moving data from Lustre to GPFS Storage==
==FileZilla==
 
'''Installation
'''
 
FileZilla can be downloaded from the website https://filezilla-project.org/ under Quick download links. Download the FileZilla Client version to transfer from local to Cheaha.
 
A setup wizard window should be launched. Once the wizard is running, simply follow the prompts until the installation process is completed. After the installation procedure has been completed and the setup wizard has terminated, open FileZilla and proceed to connect it to Cheaha.
 
 
'''Connect to Cheaha (FTP Server)
'''
The first thing to do is connecting to a server. There are 2 ways to connect to Cheaha
 
1. Basic method
Go to : File- Site Manager- New Site
* Hostname cheaha.rc.uab.edu
* Username and password would be the same as cheaha login details
* Port : 22
* Logon type – Normal
* Protocol – SFTP -SSH  File Transfer Protocol
 
2. Quick connect
Enter the hostname into the quickconnect bar's Host: field, the username into the Username: field as well as the password into the Password: field. You may leave the Port: field empty unless your login information specifies a certain port to use. Now click on Quickconnect.
 
 
'''Transferring Data (Upload and Download Files)
'''
 
1. Upload Data
 
First - in the local pane - bring the directory into view which contains data to be uploaded (e.g. index.html and images/). Now, navigate to the desired target directory on the server (using the server pane's file listings). To upload the data, select the respective files/directories and drag them from the local to the remote pane. You will notice that the files will be added to the transfer queue at the bottom of the window and soon thereafter get removed again - since they were (hopefully, if nothing went wrong) just uploaded to the server. The uploaded files and directories should now be displayed in the server content listing at the right side of the window.
 
2. Download Data


'''SGE and Lustre will be taken offline December 18 2016 and decommissioned.  All data remaining on Lustre after this date will be deleted.'''
Downloading files, or complete directories, works essentially the same way as uploading - you just drag the files/directories from the remote pane to the local pane this time, instead of the other way round.


Instructions for migrating data to /data/scratch/$USER location:
== Further Reading ==
* Login to the new hardware (hostname:cheaha.rc.uab.edu)
* You will notice that your /scratch/user/$USER is also mounted on the new hardware. It’s a read-only mount, and there to help you in moving your data .
* Start a rsync process using : rsync -aP /scratch/user/$USER/ /data/scratch/$USER. If the data that you would be transferring is large, then either start an [https://docs.uabgrid.uab.edu/wiki/Data_Movement#Interactive_session interactive session] for this job or create a job script.


Data in /home or /rstore isn’t affected and remains the same on both new and old hardware, hence you don’t need to move that [https://docs.uabgrid.uab.edu/wiki/Data_Movement#Job_Script data].
A good primer on data movement for researchers is [http://moo.nac.uci.edu/~hjm/HOWTO_move_data.html Harry Mangalam's HOW-TO on moving data].

Latest revision as of 15:49, 27 April 2021

'NOTE: If you find better and faster methods/tools, please add them to this page as well

There are various Linux native commands that you can use to move your data within the HPC cluster, such as mv, cp, scp etc. One of the most powerful tools for data movement on Linux is rsync, which we'll be using in our examples below.

rsync and scp can also be used for moving data from a local storage to Cheaha.

For moving large volumes of data to and from the cluster we also recommend the Globus service.

Globus

Please see the dedicated page on moving data with Globus. This is a good option if the site or user you are collaborating has Globus but may require the installation of a transfer agent on the users computer.

Rclone

Here's example video to setup rclone for box.com.

Data Transfer Cheaha to BOX. In the terminal (inside the VNC session), load the module rclone

module load rclone/1.48.0

The initial setup for Box involves getting a token from Box rclone config walks you through it. Here is an example of how to make a remote called remote. First run:

 rclone config

This will guide you through an interactive setup process:

No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config
n/s/q> n
name> remote
Type of storage to configure.
Choose a number from below, or type in your own value
[snip]
XX / Box
   \ "box"
[snip]
Storage> box
Box App Client Id - leave blank normally.
client_id> 
Box App Client Secret - leave blank normally.
client_secret> 
Remote config
Use auto config?
 * Say Y if not sure
 * Say N if you are working on a remote or headless machine
y) Yes
n) No
y/n> y
If your browser doesn't open automatically go to the following link: http://127.0.0.1:53682/auth
Log in and authorize rclone for access
Waiting for code...
Got code
--------------------
[remote]
client_id = 
client_secret = 
token = {"access_token":"XXX","token_type":"bearer","refresh_token":"XXX","expiry":"XXX"}
--------------------
y) Yes this is OK
e) Edit this remote
d) Delete this remote
y/e/d> y

rclone ls List the objects in the path with size and path.

rclone ls remote:path [flags]

rclone copy Copy files from source to dest, skipping already copied. Note: Use the -P/--progress flag to view real-time transfer statistics

rclone copy source:sourcepath dest:destpath

RSync

To find out more information such as flags, usage etc. about any of the above mentioned tools, you can use man TOOL_NAME.

[build@c0051 ~]$ man rsync

NAME
       rsync - a fast, versatile, remote (and local) file-copying tool

SYNOPSIS
       Local:  rsync [OPTION...] SRC... [DEST]

       Access via remote shell:
         Pull: rsync [OPTION...] [USER@]HOST:SRC... [DEST]
         Push: rsync [OPTION...] SRC... [USER@]HOST:DEST

       Access via rsync daemon:
         Pull: rsync [OPTION...] [USER@]HOST::SRC... [DEST]
               rsync [OPTION...] rsync://[USER@]HOST[:PORT]/SRC... [DEST]
         Push: rsync [OPTION...] SRC... [USER@]HOST::DEST
               rsync [OPTION...] SRC... rsync://[USER@]HOST[:PORT]/DEST

       Usages with just one SRC arg and no DEST arg will list the source files
       instead of copying.

DESCRIPTION
 .
 .
 .

If you are interested in finding out about various methods of moving data and various tools which can be used to achieve that aim, this page provides a very good description/guide : How to transfer large amounts of data via network..

Moving local data to Cheaha

To move data from your local storage to cheaha, using rsync, you can use the following command, on your local system:

rsync -aP PATH_TO_FILE_OR_DIRECTORY BLAZERID@cheaha.rc.uab.edu:PATH_ON_CHEAHA

where,

  • PATH_TO_FILE_OR_DIRECTORY is the full/relative path to file or directory that you want to move to cheaha. If transferring a directory, a trailing slash on the source creates an additional directory level at the destination. You can think of a trailing / on a source as meaning "copy the contents of this directory" as opposed to "copy the directory by name"
  • PATH_ON_CHEAHA is the full path on cheaha, where you want to move your directory. Remember, you have 20GB of space in your HOME directory, os if you are moving raw data to cheaha, please move it in either, /data/user/$BLAZERID or /data/scratch/$BLAZERID

Moving data on Cheaha to your local storage

To move data from cheaha to your local storage, using rsync, you can use the following command, on your local system:

rsync -aP BLAZERID@cheaha.rc.uab.edu:PATH_TO_FILE_OR_DIRECTORY_ON_CHEAHA PATH_ON_LOCAL_STORAGE

where,

  • PATH_TO_FILE_OR_DIRECTORY_ON_CHEAHA is the full path to file or directory on cheaha that you want to move to your local storage. If transferring a directory, a trailing slash on the source creates an additional directory level at the destination. You can think of a trailing / on a source as meaning "copy the contents of this directory" as opposed to "copy the directory by name"
  • PATH_ON_LOCAL_STORAGE is the full/relative path on cheaha, where you want to move your directory.

Privacy

Do not store sensitive information on this filesystem. It is not encrypted. Note that your data will be stored on the cluster filesystem, and while not accessible to ordinary users, it could be accessible to the cluster administrator(s).

Moving large amount of data on cheaha

If the data that you are moving is large, then you should always use either an interactive session or a job script for your data movement. This ensures that the process for your data movement isn't using and slowing login nodes for a long time, and instead is performing these operations on a compute node.

Interactive session

  • Start an interactive session using srun
srun --ntasks=1 --mem-per-cpu=1024 --time=08:00:00 --partition=medium --job-name=DATA_TRANSFER --pty /bin/bash

NOTE: Please change the time required and the corresponding partition according to your need.

  • Start an rsync process to start the transfer, once you have moved from login001 to c00XX node:
[build@c0051 Salmon]$ rsync -aP SOURCE_PATH DESTINATION_PATH

Job Script

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#SBATCH --ntasks=1
#SBATCH --partition=express
#
# Time format = HH:MM:SS, DD-HH:MM:SS
#
#SBATCH --time=10:00
#
# Mimimum memory required per allocated  CPU  in  MegaBytes. 
#
#SBATCH --mem-per-cpu=2048
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=YOUR_EMAIL_ADDRESS

rsync -aP SOURCE_PATH DESTINATION_PATH

NOTE:

  • Please change the time required and the corresponding partition according to your need.
  • After modifications to the given job script, submit it using : sbatch JOB_SCRIPT

FileZilla

Installation

FileZilla can be downloaded from the website https://filezilla-project.org/ under Quick download links. Download the FileZilla Client version to transfer from local to Cheaha.

A setup wizard window should be launched. Once the wizard is running, simply follow the prompts until the installation process is completed. After the installation procedure has been completed and the setup wizard has terminated, open FileZilla and proceed to connect it to Cheaha.


Connect to Cheaha (FTP Server) The first thing to do is connecting to a server. There are 2 ways to connect to Cheaha

1. Basic method Go to : File- Site Manager- New Site

  • Hostname cheaha.rc.uab.edu
  • Username and password would be the same as cheaha login details
  • Port : 22
  • Logon type – Normal
  • Protocol – SFTP -SSH File Transfer Protocol

2. Quick connect Enter the hostname into the quickconnect bar's Host: field, the username into the Username: field as well as the password into the Password: field. You may leave the Port: field empty unless your login information specifies a certain port to use. Now click on Quickconnect.


Transferring Data (Upload and Download Files)

1. Upload Data

First - in the local pane - bring the directory into view which contains data to be uploaded (e.g. index.html and images/). Now, navigate to the desired target directory on the server (using the server pane's file listings). To upload the data, select the respective files/directories and drag them from the local to the remote pane. You will notice that the files will be added to the transfer queue at the bottom of the window and soon thereafter get removed again - since they were (hopefully, if nothing went wrong) just uploaded to the server. The uploaded files and directories should now be displayed in the server content listing at the right side of the window.

2. Download Data

Downloading files, or complete directories, works essentially the same way as uploading - you just drag the files/directories from the remote pane to the local pane this time, instead of the other way round.

Further Reading

A good primer on data movement for researchers is Harry Mangalam's HOW-TO on moving data.