Winter2018Maintenance

From UABgrid Documentation
Jump to: navigation, search

The HPC 2018 Winter Maintenance is scheduled from Sunday December 16 through Saturday December 22, 2018. This maintenance requires user action to preserve files in /data/scratch.

Please review the details below to determine if this affects your data. During the maintenance, job execution will be suspended and any jobs remaining in the queue at the start of the maintenance will be removed to allow for service and upgrades to the cluster.

This maintenance involves service to the cluster storage to add capacity and increase performance. The /data/user and /data/project storage locations will be preserved. However, DATA IN /data/scratch WILL NOT BE PRESERVED. Please arrange to move data you wish to preserve to your /data/user, /data/project or other off-cluster storage.

Users are reminded that /data/scratch is a location for temporary file storage during computation and provides no assurances of long-term availability. Data that must remain available beyond job execution time frames should be moved to /data/user or /data/project

As always, we will work to maintain access to the login node and file system so that data access operations are minimally impacted. If possible, we will also reduce the period of time that compute nodes are unavailable. Our goal is to complete these updates with minimal disruption. Unfortunately, some steps still require user-visible restarts to systems and services.

As the maintenance time approaches, only jobs that can complete before the maintenance time will be queued and initiated. This is intended to ensure no pending jobs can remain in the queue during the maintenance window.

When: December 16 thru 22

What:

  • Cluster management stack will be updated from Bright Cluster Manager 8.0 to 8.1
  • Operating System upgraded to RHEL 7.6
  • Slurm job scheduler will be updated from 17.02.2 to 17.11.8 (or 18.02.x possibly)
  • Enabled pam_slurm.so to limit SSH access to compute nodes for users with active job(s) on the node
  • Slurm epilog script to report job resource utilization
  • BETA - Rollout Open On Demand portal - <LINK TO MORE INFORMATION>
  • Possibly moving from tmod to lmod if we can confirm seamless transition
  • CUDA version updates
  • Mellanox OFED version updates
  • Migrate /data/user and /data/project to new GPFS storage
  • Upgrade firmware on hardware

Please contact support@listserv.uab.edu with any questions or concerns.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox