Cluster Status
Clusters
| Cluster | Status | Planned Outage | Notes |
|---|---|---|---|
| Argo | Online | - | |
| Siku | Online | - | reduced capacity due to A/C issues |
| Placentia | Online | - | Restricted since March 2019 |
For national clusters (Arbutus, Fir, Narval, Nibi, Rorqual, Trillium) see status.alliancecan.ca
Services
| Service | Status | Planned Outage | Notes |
|---|---|---|---|
| Globus at Argo | Online | - | |
| Globus at Siku | Online | - | Academic users only |
| Account creation | Manual | No outages | Write support |
| PGI and Intel licenses | Online | No outages |
- Legend:
| Online | cluster is up and running |
| Offline | all users cannot login or submit jobs, or service is not working |
| Online | some users can login and/or there are problems affecting your work |
Outage schedule
Jobs will not be scheduled with a run time (--time=) that extends into the beginning of a planned outage period. This is so the job will not be terminated prematurely when the system goes down.
- Siku still running at reduced capacity due to ongoing A/C issues. See below for details.
Siku
2025 (September to December)
- We are experiencing issues with the cooling in Siku's data centre. Therefore we will have to terminate all running jobs later today in order to reduce the load on the air-conditioning unit. We will post updates here once we know more.
- Monday, Nov 3, 12h15 NST
- UPDATE #1 Mon Nov 3, 16h00: We have terminated most of the running jobs. Affected users have been contacted individually. Since the temperature in the data-centre has stabilized we are not planning on cancelling any more jobs, though that could possibly change.
Now we are waiting for the A/C technicians to identify and fix the cooling unit, though we don't yet have a timeline for that. - UPDATE #2 Tue Nov 4, 14h15 NST: We have continuing problems with Siku's A/C unit. Work is being done to assess and mitigate. More news will follow tomorrow afternoon.
- UPDATE #3 Fri Nov 7, 14h30 NST: The a/c has been partially fixed and Siku now operates at a reduced capacity with the following limitations:
- * CPU-jobs only, because GPUs-jobs could exceed our current cooling capacity.
- * Only jobs up to 72 hours (24 hours for
def-*accounts, to reduce impact in case cooling issues arise again and improve turnover. - * We will monitor a/c capacity and reassess whether we can increase Siku's capacity even more.
- * We don't yet have a timeline for Siku's full return to service, since we are waiting for replacement parts for the A/C.
- UPDATE #4 Fri Dec 5, 09h40 NST: Parts have been received, repair of the A/C is expected to begin Monday December 8.
For older outages see: Previous outages
Argo
2025
- Argo: This morning at about 8:30 am NST (12h00 UTC), IT Services implemented a network change causing a short interruption of Argo's external network-connection. Running jobs were not be affected, though some active SSH sessions and file transfers may have dropped. The work was completed in less than five minutes.
- Monday, Dec 8, 09:00 NST
- Partial power outage at Argo. MUN Facilities Management will be powering down individual power rails, causing some or all nodes to reboot. A reservation will prevent jobs from starting unless they finish before 10am Nfld (13h30 UTC) on that day.
- Monday, November 24, 2025
- 1st UPDATE Mon Nov 24, 17h15 NST: Work has completed for today, but we expect more work being carried out tomorrow. We've allowed jobs to run overnight as long as they finish by 7h30 NST (11h00 UTC) Tuesday, Nov 25.
- 2nd UPDATE Tue Nov 25, 16h15 NST: Power work in the Argo data centre was completed and Argo has returned to full capacity. We don't expect any similar outages in the near future.
- At about 11:30 a.m NST at Argo we noticed a sudden loss of all networked filesystems (/home, /project and /scratch). To resolve the issues all nodes had to be rebooted.
- Wednesday Nov 5, 12:30 NST
- UPDATE Wed Nov 5, 17h10 NST: Almost all nodes of Argo have returned normal service. Remaining nodes will be re-enabled tomorrow.
- The Memorial University is experiencing network connectivity issues that may prevent access to Argo and Siku. MUN IT Services are investigating and working towards resolving the issues.
- Wednesday Aug 27, 10:30 NDT
- UPDATE Thu Aug 28, 10h30: Memorial University announced that all network issues had been resolved by Wednesday 18h00 NDT. Since the issues mostly affected connecting from the MUN's network to resources outside of the university, the impact on users of ACENET systems was minimal.
- Both Siku and Argo started experiencing network-connectivity issues around 09h30 NDT. The systems team is on-site to investigate and are working on resolving the issue.
- Tue Jun 3 2025 10:24 NDT
- UPDATE June 3, 12h00 NDT: External network activity has been restored to both Siku and Argo around 30 minutes ago. Running and queued jobs were unaffected by this issue.
- Both Siku and Argo were offline from March 18 to 20 for network- and system maintenance.
During the outage, the public IP addresses of both clusters has changed and moved to a different subnet and software updates will be installed.
Also storage quotas are now being enforced at Argo.
- Wed Mar 12 2025 11:30 NDT
- UPDATE March 18, 08h30 NDT: The planned maintenance has started. We will continue to post updates here.
- UPDATE March 20, 10h10 NDT: The planned maintenance has been completed and job scheduling has been resumed.
- UPDATE March 27, 09h30 NDT: Globus file transfer at Argo has been restored.
- Due to a critical cooling failure in the data-centre we had to perform an emergency shutdown of Argo on the morning of Saturday, February 15th. We expect Argo to become available again sometime on Monday, February 17.
- 12:30, Feb 15, 2025 (NST)
- Update #1: Argo's login nodes and filesystems are available again, however the compute nodes will remain offline until next week.
- 14:30, Feb 15, 2025 (NST)
- Update #2: Over the course of today we have released about half of Argo's CPU nodes and all GPU nodes back into production. We continue to work on the remaining nodes.
- 16:30, Feb 17, 2025 (NST)
- Update #3: Most of Argo's compute nodes are back in production and we will continue enabling the remaining ones as soon as they are available.
- 13:30, Feb 19, 2025 (NST)
- Argo suffered an electrical power event on Friday evening (Jan 17) around 18h00 NST (21h30 UTC) which brought down some components. The cluster is back in production at this hour. Some compute nodes have not yet recovered; we are working to bring them back.
- 10:30, Jan 20, 2025 (NST)
2024
- Argo suffered an electrical power event last night (Nov 19-20) which brought down some components. The cluster is back in production at this hour. Some compute nodes have not yet recovered; sysadmins are working to bring them back.
- 12:10, Nov 20, 2024 (NST)
- Argo was offline from October 28 to 30, 2024 for electrical power work, some upgrades of infrastructure machines, and some software and firmware updates. Service was resumed on Thursday October 31st at around 14h00 NDT with about 75% of its CPU-capacity while the remaining nodes are being worked on.
- 14:40, Oct 31, 2024 (NDT)
- Update: The GPU nodes
argo[72-73]have been returned to service- 17:00, Nov 1, 2024 (NDT)
Placentia
- Placentia was retired from general service as of 2019 Mar 31. A reduced number of compute nodes remain in service, with access restricted to MUN users who have made suitable arrangements. Contact support@ace-net.ca if you believe you should have access.
Nefelibata
- Nefelibata has been retired from service
- 2025-10-01