Winter2018Maintenance: Difference between revisions

From Cheaha
Jump to navigation Jump to search
(Created page with "The HPC 2018 Winter Maintenance is scheduled from Sunday December 16 through Saturday December 22, 2018. This maintenance requires user action to preserve files in /data/sc...")
 
No edit summary
Line 1: Line 1:
The HPC 2018 Winter Maintenance is scheduled from Sunday December 16 through Saturday December 22, 2018.    This maintenance requires user action to preserve files in /data/scratch.  Please review the details below to determine if this affects your data.
'''The HPC 2018 Winter Maintenance is scheduled from Sunday December 16 through Saturday December 22, 2018.    This maintenance requires user action to preserve files in /data/scratch.'''  
 


Please review the details below to determine if this affects your data.
During the maintenance, job execution will be suspended and any jobs remaining in the queue at the start of the maintenance will be removed to allow for service and upgrades to the cluster.
During the maintenance, job execution will be suspended and any jobs remaining in the queue at the start of the maintenance will be removed to allow for service and upgrades to the cluster.


Line 25: Line 25:
What:
What:


    Cluster management stack will be updated from Bright Cluster Manager 8.0 to 8.1
  * Cluster management stack will be updated from Bright Cluster Manager 8.0 to 8.1
    Operating System upgraded to RHEL 7.6
  * Operating System upgraded to RHEL 7.6
    Slurm job scheduler will be updated from 17.02.2 to 17.11.8 (or 18.02.x possibly)
  * Slurm job scheduler will be updated from 17.02.2 to 17.11.8 (or 18.02.x possibly)
    Enabled pam_slurm.so to limit SSH access to compute nodes for users with active job(s) on the node
  * Enabled pam_slurm.so to limit SSH access to compute nodes for users with active job(s) on the node
    Slurm epilog script to report job resource utilization
  * Slurm epilog script to report job resource utilization
    BETA - Rollout Open On Demand portal - <LINK TO MORE INFORMATION>
  * BETA - Rollout Open On Demand portal - <LINK TO MORE INFORMATION>
    Possibly moving from tmod to lmod if we can confirm seamless transition
  * Possibly moving from tmod to lmod if we can confirm seamless transition
    CUDA version updates
  * CUDA version updates
    Mellanox OFED version updates
  * Mellanox OFED version updates
    Migrate /data/user and /data/project to new GPFS storage
  * Migrate /data/user and /data/project to new GPFS storage
    Upgrade firmware on hardware
  * Upgrade firmware on hardware





Revision as of 16:33, 17 December 2018

The HPC 2018 Winter Maintenance is scheduled from Sunday December 16 through Saturday December 22, 2018. This maintenance requires user action to preserve files in /data/scratch.

Please review the details below to determine if this affects your data. During the maintenance, job execution will be suspended and any jobs remaining in the queue at the start of the maintenance will be removed to allow for service and upgrades to the cluster.


This maintenance involves service to the cluster storage to add capacity and increase performance. The /data/user and /data/project storage locations will be preserved. However, DATA IN /data/scratch WILL NOT BE PRESERVED. Please arrange to move data you wish to preserve to your /data/user, /data/project or other off-cluster storage.


Users are reminded that /data/scratch is a location for temporary file storage during computation and provides no assurances of long-term availability. Data that must remain available beyond job execution time frames should be moved to /data/user or /data/project.


As always, we will work to maintain access to the login node and file system so that data access operations are minimally impacted. If possible, we will also reduce the period of time that compute nodes are unavailable. Our goal is to complete these updates with minimal disruption. Unfortunately, some steps still require user-visible restarts to systems and services.


As the maintenance time approaches, only jobs that can complete before the maintenance time will be queued and initiated. This is intended to ensure no pending jobs can remain in the queue during the maintenance window.


When: December 16 thru 22


What:

  * Cluster management stack will be updated from Bright Cluster Manager 8.0 to 8.1
  * Operating System upgraded to RHEL 7.6
  * Slurm job scheduler will be updated from 17.02.2 to 17.11.8 (or 18.02.x possibly)
  * Enabled pam_slurm.so to limit SSH access to compute nodes for users with active job(s) on the node
  * Slurm epilog script to report job resource utilization
  * BETA - Rollout Open On Demand portal - <LINK TO MORE INFORMATION>
  * Possibly moving from tmod to lmod if we can confirm seamless transition
  * CUDA version updates
  * Mellanox OFED version updates
  * Migrate /data/user and /data/project to new GPFS storage
  * Upgrade firmware on hardware


Please contact support@listserv.uab.edu with any questions or concerns.


Thank you,


Research Computing