Cluster Status

From ACENET
Jump to: navigation, search
Ambox notice.png This page is maintained manually. It gets updated as soon as we learn new information.

Clusters

Cluster Status Planned Outage Notes
Argo Online -
Siku Online - reduced capacity due to A/C issues
Placentia Online - Restricted since March 2019


For national clusters (Arbutus, Fir, Narval, Nibi, Rorqual, Trillium) see status.alliancecan.ca

Services

Service Status Planned Outage Notes
Globus at Argo Online -
Globus at Siku Online - Academic users only
Account creation Manual No outages Write support
PGI and Intel licenses Online No outages
Legend:
Online cluster is up and running
Offline all users cannot login or submit jobs, or service is not working
Online some users can login and/or there are problems affecting your work

Outage schedule

Jobs will not be scheduled with a run time (--time=) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.

  • Siku still running at reduced capacity due to ongoing A/C issues. See below for details.


Siku

2025 (September to December)

  • We are experiencing issues with the cooling in Siku's data centre. Therefore we will have to terminate all running jobs later today in order to reduce the load on the air-conditioning unit. We will post updates here once we know more.
Monday, Nov 3, 12h15 NST
UPDATE #1 Mon Nov 3, 16h00: We have terminated most of the running jobs. Affected users have been contacted individually. Since the temperature in the data-centre has stabilized we are not planning on cancelling any more jobs, though that could possibly change.
Now we are waiting for the A/C technicians to identify and fix the cooling unit, though we don't yet have a timeline for that.
UPDATE #2 Tue Nov 4, 14h15 NST: We have continuing problems with Siku's A/C unit. Work is being done to assess and mitigate. More news will follow tomorrow afternoon.
UPDATE #3 Fri Nov 7, 14h30 NST: The a/c has been partially fixed and Siku now operates at a reduced capacity with the following limitations:
* CPU-jobs only, because GPUs-jobs could exceed our current cooling capacity.
* Only jobs up to 72 hours (24 hours for def-* accounts, to reduce impact in case cooling issues arise again and improve turnover.
* We will monitor a/c capacity and reassess whether we can increase Siku's capacity even more.
* We don't yet have a timeline for Siku's full return to service, since we are waiting for replacement parts for the A/C.
UPDATE #4 Fri Dec 5, 09h40 NST: Parts have been received, repair of the A/C is expected to begin Monday December 8.

For older outages see: Previous outages

Argo

2025

  • Argo: This morning at about 8:30 am NST (12h00 UTC), IT Services implemented a network change causing a short interruption of Argo's external network-connection. Running jobs were not be affected, though some active SSH sessions and file transfers may have dropped. The work was completed in less than five minutes.
Monday, Dec 8, 09:00 NST
  • Partial power outage at Argo. MUN Facilities Management will be powering down individual power rails, causing some or all nodes to reboot. A reservation will prevent jobs from starting unless they finish before 10am Nfld (13h30 UTC) on that day.
Monday, November 24, 2025
1st UPDATE Mon Nov 24, 17h15 NST: Work has completed for today, but we expect more work being carried out tomorrow. We've allowed jobs to run overnight as long as they finish by 7h30 NST (11h00 UTC) Tuesday, Nov 25.
2nd UPDATE Tue Nov 25, 16h15 NST: Power work in the Argo data centre was completed and Argo has returned to full capacity. We don't expect any similar outages in the near future.
  • At about 11:30 a.m NST at Argo we noticed a sudden loss of all networked filesystems (/home, /project and /scratch). To resolve the issues all nodes had to be rebooted.
Wednesday Nov 5, 12:30 NST
UPDATE Wed Nov 5, 17h10 NST: Almost all nodes of Argo have returned normal service. Remaining nodes will be re-enabled tomorrow.
  • The Memorial University is experiencing network connectivity issues that may prevent access to Argo and Siku. MUN IT Services are investigating and working towards resolving the issues.
Wednesday Aug 27, 10:30 NDT
UPDATE Thu Aug 28, 10h30: Memorial University announced that all network issues had been resolved by Wednesday 18h00 NDT. Since the issues mostly affected connecting from the MUN's network to resources outside of the university, the impact on users of ACENET systems was minimal.
  • Both Siku and Argo started experiencing network-connectivity issues around 09h30 NDT. The systems team is on-site to investigate and are working on resolving the issue.
Tue Jun 3 2025 10:24 NDT
UPDATE June 3, 12h00 NDT: External network activity has been restored to both Siku and Argo around 30 minutes ago. Running and queued jobs were unaffected by this issue.
  • Both Siku and Argo were offline from March 18 to 20 for network- and system maintenance.
    During the outage, the public IP addresses of both clusters has changed and moved to a different subnet and software updates will be installed.
    Also storage quotas are now being enforced at Argo.
Wed Mar 12 2025 11:30 NDT
UPDATE March 18, 08h30 NDT: The planned maintenance has started. We will continue to post updates here.
UPDATE March 20, 10h10 NDT: The planned maintenance has been completed and job scheduling has been resumed.
UPDATE March 27, 09h30 NDT: Globus file transfer at Argo has been restored.


  • Due to a critical cooling failure in the data-centre we had to perform an emergency shutdown of Argo on the morning of Saturday, February 15th. We expect Argo to become available again sometime on Monday, February 17.
12:30, Feb 15, 2025 (NST)
Update #1: Argo's login nodes and filesystems are available again, however the compute nodes will remain offline until next week.
14:30, Feb 15, 2025 (NST)
Update #2: Over the course of today we have released about half of Argo's CPU nodes and all GPU nodes back into production. We continue to work on the remaining nodes.
16:30, Feb 17, 2025 (NST)
Update #3: Most of Argo's compute nodes are back in production and we will continue enabling the remaining ones as soon as they are available.
13:30, Feb 19, 2025 (NST)


  • Argo suffered an electrical power event on Friday evening (Jan 17) around 18h00 NST (21h30 UTC) which brought down some components. The cluster is back in production at this hour. Some compute nodes have not yet recovered; we are working to bring them back.
10:30, Jan 20, 2025 (NST)

2024

  • Argo suffered an electrical power event last night (Nov 19-20) which brought down some components. The cluster is back in production at this hour. Some compute nodes have not yet recovered; sysadmins are working to bring them back.
12:10, Nov 20, 2024 (NST)
  • Argo was offline from October 28 to 30, 2024 for electrical power work, some upgrades of infrastructure machines, and some software and firmware updates. Service was resumed on Thursday October 31st at around 14h00 NDT with about 75% of its CPU-capacity while the remaining nodes are being worked on.
14:40, Oct 31, 2024 (NDT)
Update: The GPU nodes argo[72-73] have been returned to service
17:00, Nov 1, 2024 (NDT)

Placentia

  • Placentia was retired from general service as of 2019 Mar 31. A reduced number of compute nodes remain in service, with access restricted to MUN users who have made suitable arrangements. Contact support@ace-net.ca if you believe you should have access.

Nefelibata

  • Nefelibata has been retired from service
2025-10-01