Winter2018Maintenance

From Cheaha
Revision as of 16:33, 17 December 2018 by Tanthony@uab.edu (talk | contribs)
Jump to navigation Jump to search


Attention: Research Computing Documentation has Moved
https://docs.rc.uab.edu/


Please use the new documentation url https://docs.rc.uab.edu/ for all Research Computing documentation needs.


As a result of this move, we have deprecated use of this wiki for documentation. We are providing read-only access to the content to facilitate migration of bookmarks and to serve as an historical record. All content updates should be made at the new documentation site. The original wiki will not receive further updates.

Thank you,

The Research Computing Team

The HPC 2018 Winter Maintenance is scheduled from Sunday December 16 through Saturday December 22, 2018. This maintenance requires user action to preserve files in /data/scratch.

Please review the details below to determine if this affects your data. During the maintenance, job execution will be suspended and any jobs remaining in the queue at the start of the maintenance will be removed to allow for service and upgrades to the cluster.


This maintenance involves service to the cluster storage to add capacity and increase performance. The /data/user and /data/project storage locations will be preserved. However, DATA IN /data/scratch WILL NOT BE PRESERVED. Please arrange to move data you wish to preserve to your /data/user, /data/project or other off-cluster storage.


Users are reminded that /data/scratch is a location for temporary file storage during computation and provides no assurances of long-term availability. Data that must remain available beyond job execution time frames should be moved to /data/user or /data/project.


As always, we will work to maintain access to the login node and file system so that data access operations are minimally impacted. If possible, we will also reduce the period of time that compute nodes are unavailable. Our goal is to complete these updates with minimal disruption. Unfortunately, some steps still require user-visible restarts to systems and services.


As the maintenance time approaches, only jobs that can complete before the maintenance time will be queued and initiated. This is intended to ensure no pending jobs can remain in the queue during the maintenance window.


When: December 16 thru 22


What:

  * Cluster management stack will be updated from Bright Cluster Manager 8.0 to 8.1
  * Operating System upgraded to RHEL 7.6
  * Slurm job scheduler will be updated from 17.02.2 to 17.11.8 (or 18.02.x possibly)
  * Enabled pam_slurm.so to limit SSH access to compute nodes for users with active job(s) on the node
  * Slurm epilog script to report job resource utilization
  * BETA - Rollout Open On Demand portal - <LINK TO MORE INFORMATION>
  * Possibly moving from tmod to lmod if we can confirm seamless transition
  * CUDA version updates
  * Mellanox OFED version updates
  * Migrate /data/user and /data/project to new GPFS storage
  * Upgrade firmware on hardware


Please contact support@listserv.uab.edu with any questions or concerns.


Thank you,


Research Computing