Posts

Firebird VPN testing

Note: This test, previously scheduled for April 9, has been postponed. We will provide an update when it has been rescheduled.

The OIT Networking team will be testing updates to the PACE VPN portal, which controls access to Firebird. There is no expected impact to existing nor new connections during the test. The test should conclude in a single evening, with the prior configuration restored. 

In case of unexpected impact, Firebird researchers may temporarily be unable to connect to the PACE VPN. Please report PACE VPN errors during the test to the GT Administrative Services Center at https://gatech.service-now.com/technology

This test is necessary to prepare for upcoming changes to VPN authentication procedures and ensure a smooth transition. 

Please email pace-support@oit.gatech.edu with any questions.

Lustre project storage outage

We appreciate your patience. We identified an issue in which a user was running an excessive number of file operations jobs on Phoenix Lustre project storage. We canceled those jobs, reached out to the user, and are now observing a return to normal system performance.

Lustre issues: Open OnDemand

Update: The Phoenix OnDemand server has become unresponsive, so all users will now be unable to reach it. The PACE team is investigating. Access to the Lustre project storage filesystem has been restored, and it is now accessible from login and compute nodes. PACE continues to monitor the filesystem.

Lustre project file system issues on Phoenix cluster

Dear Phoenix Users,

Summary: /storage/coda1 is partially available from Phoenix cluster, affecting users with project directories in that file system (Lustre project storage).

Details: One of the metadata servers for the Phoenix Lustre project storage system rebooted overnight and has not yet returned to service. We are working to restore access.

Impact: Login via ssh to Phoenix may hang temporarily for all users but should eventually succeed. Researchers who have not migrated from Lustre project storage to VAST may not be able to access Phoenix OnDemand, and access to Lustre project storage may not be available from any Phoenix system. VAST project storage is not affected, nor is Lustre scratch storage.

Thank you for your patience as we work to resolve this issue. Please contact pace-support@oit.gatech.edu for any questions or concerns.

Best,

-The PACE Team

PACE Unplanned Downtime – March 16th

Dear PACE Users, 

A critical security vulnerability has been identified in an authentication tool required for our systems. These types of issues need to be resolved within 30 days of the fix being available, following GT security guidelines. All PACE clusters (ICE, Phoenix and Firebird) will undergo a brief maintenance period starting at 6:00am on March 16th, to apply this fix. 

A reservation has been set on all schedulers to prevent any jobs starting that would not complete before 6:00am on March 16th. We expect to complete this update within the day and will notify users of each cluster when they can resume work. During this time, any job that is submitted will be held until the patching work and testing completes, and normal operations are resumed. Users will be able to login and access data.

Reminder emails will be shared as we approach the time of the maintenance. 

Best, 

The PACE Team

PACE Maintenance Period – 01/12/26 to 01/16/26

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM Monday January 12th and is scheduled to end no later than 11:59PM on Thursday January 15th; ICE will open to Spring 2026 courses on Friday, January 16th The additional day is needed to install a second cooling pump at the data center to provide redundancy for PACE clusters. PACE will release each cluster (Phoenix, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed. PACE will release each cluster (Phoenix, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed.  

 
WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • None 

ITEMS NOT REQUIRING USER ACTION: 

  • [all] DataBank will install a second cooling pump into the research hall cooling loop, providing redundancy. 
  • [all] Apply maintenance updates to all compute nodes 
  • [Phoenix, ICE, Firebird] Upgrade clusters to Slurm 25.05.5 
  • [Storage] Enable Write Back on all VAST storage for performance improvements 
  • [all] Replace some PDU and IB network switches with new equipment 
  • [Storage] Apply maintenance upgrades to Lustre file system appliances 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Globus Connectors for cloud storage on PACE

We’d like to highlight Globus Connectors for cloud storage as the best way to transfer files between PACE and cloud storage services, including Dropbox, Box, and OneDrive. Globus Connectors make it easy to move files between PACE storage (on Phoenix, ICE, or CEDAR) and cloud storage through Globus’s web interface.

Please avoid large transfers to/from the cloud via rclone or other services on the login node (such as the Dropbox API), as these can cause heavy load on the campus network and impact other researchers. PACE has purchased the cloud connectors to provide a better option, for easier use and less network strain.

You can learn more about how to use Globus, including cloud connectors, on PACE in our documentation. Please contact us with questions, or to suggest other cloud storage services for which the connectors could be installed if they would enable your research.

Complete: PACE Maintenance Period (Oct 6 – Oct 8, 2025)

Dear PACE Users,  

Maintenance on the Phoenix, ICE, Firebird, Buzzard, and Hive clusters is complete. All clusters are back in production. As a reminder, Hive is only available for data retrieval until November 1st.  

A message to Phoenix users:  

Phoenix login nodes were upgraded from Intel Cascade Lake to Granite Rapids as a part of PACE’s ongoing efforts to provide cutting-edge research infrastructure – in addition, FIVE new H200 GPU nodes have been added to the system.  

Researchers can leverage Granite Rapids’ Advanced Matrix Extension (AMX) instructions to improve performance of linear algebra operations. AMX optimization requires intentional code changes importantly, AMX-enabled code is likely not backwards compatible with older hardware (e.g., Cascade Lake). Instructions for compiling software for different CPU architectures is available here

Researchers not utilizing AMX should not be impacted by the login node upgrade; however, some code/software may be sensitive to an underlying change in compute architecture. Prior to the upgrade, both login and default compute nodes were Cascade Lake architecture. Post-upgrade, login nodes are Granite Rapids, while the default compute nodes remain Cascade Lake. If your jobs encounter errors such as “Illegal instruction” errors, researchers should compile their code on the architecture that they want to run on. 

Best,  

The PACE Team 

1-Week Reminder – PACE Maintenance Period (Oct 6 – Oct 8, 2025)

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Monday, 10/06/2025, and is tentatively scheduled to conclude by 11:59PM on Wednesday, 10/8/2025. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO? 

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING? 

  • All Systems: Cooling tower maintenance and cleanup
  • All Systems: Updating to RHEL 9.6 Operating System
  • Phoenix: New GNR (Granite Rapids) login nodes coming online!
  • Phoenix and ICE: Filesystem checks for project and scratch
  • Phoenix: Updating load balancer for login nodes
  • IDEaS Storage: Updating LDAP configuration

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. 

WHO IS AFFECTED?

All users across all PACE clusters. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog

Thank you, 

The PACE Team

OnDemand Web Portal Access Outage (Campus-Wide)

[Update 10/1/25 12:15 PM]

At this time, PACE expects that all PACE users should have restored access to Phoenix/Hive OnDemand. Anyone should email pace-support@oit.gatech.edu if they are still receiving an unexpected “unauthorized” message on Phoenix or Hive OnDemand. 

Efforts towards a long-term fix are still in progress. 

Please continue to visit status.gatech.edu for updates. 

[Update 9/30/25 10:40 AM]

The issue appears to be resolved for GT students, faculty, and staff at this time. Troubleshooting continues for impacted external guest accounts while a long-term fix is being developed. 
GT students and employees should email pace-support@oit.gatech.edu if they are still receiving an unexpected “unauthorized” message on Phoenix or Hive OnDemand. 

Please continue to visit status.gatech.edu for updates. 

[Original post 9/29/25 12:30 PM]

Summary: Some researchers are intermittently unable to access Phoenix/Hive OnDemand web portals due to a campus-wide access management outage.

Details: Due to an access management outage affecting various campus services, some users’ identity status has become invalid this morning. Campus IT staff are currently investigating and will continue to post updates on the outage on status.gatech.edu.

Impact: Some Phoenix and Hive users may receive an “Unauthorized” message when attempting to reach the Phoenix/Hive OnDemand website. There is no available workaround at this time.