Maintenance

Upcoming maintenance on the VACC cluster¶

Schedule¶

May 4th:
- Cluster will be reduced to 50% capacity starting at 6:30AM to complete electrical work on PSR4B. We expect to return to 100% capacity around 2PM.

Due to an equipment failure on April 21, we were not able to maintain our original schedule for UPS B replacement on April 22. Please come back to this page later for updates, or stay tuned to VACC-USERS@LIST.UVM.EDU.

During May/June, we expect to have one day of reduction of capacity to 50% due to the shutdown of UPS B, followed by two weeks of no backup power support to 50% of the cluster, then one day of 50% capacity while the new UPS is installed.

May/June (one day, to be scheduled)
- UPS B shutdown for removal (5 hours: 7:00AM - 12:00PM)
- VACC again reduced to 50% capacity.
- We may take this opportunity to perform storage firmware maintenance, which has a small possibility of causing filesystem unavailability. However, the risk is deemed low enough that we are not taking the cluster down.
Following two weeks:
- VACC at 100% capacity, including new equipment, but B-side circuits not UPS backed.
Two weeks after UPS B initial shutdown:
- UPS B installation (8 hours of load testing.)
- VACC at 50% capacity.
Day after installation:
- VACC returns to normal operation, with new equipment and cooling.

Note: This schedule depends upon the timely delivery of needed parts, and the availability of personnel. There could be additional delays which will affect the above schedule. We will update the schedule as changes occur.

Data Center Upgrades¶

As of February 23, 2026:

The VACC has acquired new compute hardware for IceCore:

5 (five) HPE Cray XD670 with 100Gb Eth, NDR (400Gb) IB, 1 TB RAM, 8x NVIDIA H200 SXM 700W TDP GPUs with 141 GB HBM each
6 (six) HPE DL380a GPU nodes with 8x NVIDIA RTX 6000 Server Edition GPUs with 96 GB VRAM, 100Gb Eth, NDR (400Gb) IB, 1 TB RAM
16 (sixteen) HPE DL380a GPU nodes with 100Gbe Eth, NDR (400Gb) IB, 1 TB RAM, 4x H200 NVL 600W TDP GPUs
- 11 of these already in production, in temporary racks. Three of these comprise GoldenMaple.
17 (seventeen) HPE DL365 compute nodes with 2x AMD EPYC 9655 CPUs, NDR200 IB, 100Gb Eth, 1.5 TB RAM

In order to support this new hardware for IceCore, we will need new power circuits and additional cooling capacity in the data center. We are currently at the limit of what we can safely cool without overheating.

For cooling, UVM has purchased additional cooling distribution (Motivair MCDU-40) and 3 additional Motivair ChilledDoors to cool the new VACC hardware.

Several maintenance windows will require us to temporarily reduce compute capacity in the VACC and, in some cases, schedule downtime. Parts and personnel availability will determine actual dates; we will strive to provide ample notice of outages and keep the number of outages and reduced capacity periods to a minimum.

Circuit upgrades¶

New power circuits need to be added. These are done in 2 phases: A-side circuits and B-side circuits.

A-side circuit upgrades:

One window of 4 hours where the VACC is down.
Eight hours of 50% capacity in the VACC. Deepgreen (V100 GPUs) will be entirely offline.

A pause of 2 days between A and B circuits.

B-side circuit upgrades:

Eight hours of 50% capacity in the VACC. Deepgreen (V100 GPUs) will be entirely offline.

Current estimate for the start of A-side circuit upgrades is March 9th.

Secondary cooling upgrades¶

During secondary cooling upgrades, we will need to reduce load on the VACC by 50% so that the data center does not overheat.

Secondary cooling upgrades:

Five to seven business days 50% capacity in the VACC.
At the end of the secondary cooling upgrades, commissioning tests will be performed on the new secondary cooling infrastructure. The last two days will involve GPU nodes being unavailable for users as we need to run them at 100% capacity to stress the infrastructure.

Current estimate for start of reduced capacity due to shutdown of secondary cooling is March 9th. Ideally, we will overlap the days of reduced capacity.

After secondary cooling is upgraded, Deepgreen will be retired; its GPUs will be replaced with nodes providing newer H200 GPUs.

UPS replacement¶

Our data center's UPSes are 20 years old. It is getting difficult to maintain them. We plan to replace them in the coming months.

Many compute nodes are only covered by a single UPS, so must be powered down during electrical work.

A-side UPS replacement:

4 hour window of the VACC at 50% capacity.
Four to five business days of the VACC at 100% capacity, however, 50% will not be UPS backed, so disruptions in utility power could cause node failure and loss of jobs.
8 hour window of the VACC at 50% capacity while load tests are performed, and the new UPS is connected.

B-side UPS replacement: - Essentially, a repeat of the A-side replacement.

A-side UPS replacement is estimated to begin in March. B-side UPS replacement is estimated to begin in late April.

We plan to pause data center work for Research Week (April 13-17). Remaining UPS, cooling, and power work will continue after April 20.

Completed maintenance¶

April, 2026¶

April 1-April 3:
- VACC continues at 50-100% capacity (A-side circuits/nodes not UPS backed. "Not UPS backed" means an electrical outage or disturbance could cause some compute nodes/jobs to fail.)
- IceCore (H200) nodes have migrated into new racks. This also includes GoldenMaple (H200) nodes.
- Substantial completion of secondary cooling loop. Commissioning still to be completed, but cooling is essentially online at this point.
April 4 (Saturday):
- UPS A shutdown for installation (8 hours of load testing.)
- VACC at 50% capacity.
April 5-12:
- VACC at 50-100% capacity, and beyond 100% as new hardware is brought into production.
- New scheduler configuration in place (GPU features).
April 13-17:
- UVM Research Week: VACC/IceCore available beyond 100%.

March 9-25, 2026¶

March 9:
- Secondary cooling loop was shutdown. VACC reduced to 50% capacity.
- DeepGreen was shut down permanently. No more V100 GPUs will be available.
March 10:
- VACC cluster shutdown at 5:30AM, due to "A" side electrical panel shutdown.
- During the day, capacity was brought back online to 50% capacity.
March 10-12:
- VACC continued at 50% capacity, as secondary cooling loop is offline.
March 13:
- VACC cluster resources were unavailable from 5:00AM-12:00PM, due to electrical panel shutdown.
- After power was restored, VACC came back at 50% capacity.
- Initial startup and configuration of rear door heat exchangers for IceCore.
March 14-17:
- VACC continued at 50% capacity.
- New secondary cooling installation.
March 18:
- UPS A was shutdown for removal (6 hours: 6:00AM - 12:00PM)
- Nodes supported by UPS A were unavailable.
March 19:
- VACC continues at 50-100% capacity (A-side circuits/nodes not UPS backed. "Not UPS backed" means an electrical outage or disturbance could cause some compute nodes/jobs to fail.)
- During this time, IceCore (H200) nodes will migrate into new racks. This also includes GoldenMaple (H200) nodes.

January 7-8, 2026¶

The cluster was down for scheduled maintenance to upgrade the operating system. We moved from:

RHEL 9.4 to RHEL 9.6, including many bugfixes.
Slurm from 25.05 to 25.11

GPFS3 rebuild¶

All files on /gpfs3 were deleted on January 7th so that we could rebuild the file system. A new policy of automatically deleting files that have not been accessed within 60 days was implemented. To emphasize the new policy, /gpfs3 was renamed /gpfs3tmp.

Details about /gpfs3tmp¶

To improve service to VACC users, we rebuilt the /gpfs3 filesystem on Jan 7, 2026. This filesystem was originally intended to be only for temporary files. After the rebuild, it was renamed /gpfs3tmp, and automatic purging of files that are not being accessed was implemented. Directories on it will only be created for each PI group. There are two main changes to be aware of:

Files untouched for sixty (60) days will be automatically deleted. Since this is scratch (temporary storage), there is no backup. A warning email will be sent at the forty (40) day mark. No notifications will be sent about deletions on day sixty.
No per-user directories are automatically created. Group members will be able to create subdirectories under their group's PI directory.

Regarding the previously existing gpfs3: a snapshot was be taken of the filesystem before it is deleted and rebuilt. However, our backup of the existing gpfs3 (which we do not normally perform) will only be held for 60 days (until March 8, 2026).