From UABgrid Documentation
Revision as of 10:29, 17 December 2018 by (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The HPC 2018 Winter Maintenance is scheduled from Sunday December 16 through Saturday December 22, 2018. This maintenance requires user action to preserve files in /data/scratch. Please review the details below to determine if this affects your data.

During the maintenance, job execution will be suspended and any jobs remaining in the queue at the start of the maintenance will be removed to allow for service and upgrades to the cluster.

This maintenance involves service to the cluster storage to add capacity and increase performance. The /data/user and /data/project storage locations will be preserved. However, DATA IN /data/scratch WILL NOT BE PRESERVED. Please arrange to move data you wish to preserve to your /data/user, /data/project or other off-cluster storage.

Users are reminded that /data/scratch is a location for temporary file storage during computation and provides no assurances of long-term availability. Data that must remain available beyond job execution time frames should be moved to /data/user or /data/project.

As always, we will work to maintain access to the login node and file system so that data access operations are minimally impacted. If possible, we will also reduce the period of time that compute nodes are unavailable. Our goal is to complete these updates with minimal disruption. Unfortunately, some steps still require user-visible restarts to systems and services.

As the maintenance time approaches, only jobs that can complete before the maintenance time will be queued and initiated. This is intended to ensure no pending jobs can remain in the queue during the maintenance window.

When: December 16 thru 22


   Cluster management stack will be updated from Bright Cluster Manager 8.0 to 8.1
   Operating System upgraded to RHEL 7.6
   Slurm job scheduler will be updated from 17.02.2 to 17.11.8 (or 18.02.x possibly)
   Enabled to limit SSH access to compute nodes for users with active job(s) on the node
   Slurm epilog script to report job resource utilization
   BETA - Rollout Open On Demand portal - <LINK TO MORE INFORMATION>
   Possibly moving from tmod to lmod if we can confirm seamless transition
   CUDA version updates
   Mellanox OFED version updates
   Migrate /data/user and /data/project to new GPFS storage
   Upgrade firmware on hardware

Please contact with any questions or concerns.

Thank you,

Research Computing

Personal tools