On Sunday 2025-03-09, Gandi experienced a major incident on its platform caused by a filer storage system outage affecting multiple services including mailboxes.
Summary: – Multiple services were severely disrupted from Sunday March 9th 0:31:10 until 16:49:15 including 39% of all mailboxes – Some Mailboxes (~15%) remained unavailable until Monday March 10th 10:29 . However, all users had recovered all of their emails by Wednesday March 12th 17:00:00. – Importantly, this incident did not result in the loss or corruption of any data. |
What was the root cause of the incident?:
The main cause was the failure of an SSD storage filer. However, several additional factors contributed to the severity of the impact:
- Some systems, including internal monitoring, lacked effective redundancy measures to cope with the storage disruption
- Some systems which did have redundancy at the VM level were incorrectly architected so that all the VMs relied on the single impacted filer
- Some systems that were redundant at both the VM and storage level were not provisioned with enough capacity to handle the increased load when one of the instances failed.
Full timeline:
Time stamps (UTC) | Event |
---|---|
2025-03-09 00:31:10 | Incident started, and on-call responders began investigating over 1500 alerts; difficult to know what was the root cause, and the monitoring bot was unavailable |
2025-03-09 01:11:19 | Incident was escalated and CTO responded |
2025-03-09 01:21:51 | Public status published on status.gandi.net with the first impacted services identified |
2025-03-09 01:23:31 | Attempt to declare incident via ChatOps tooling |
2025-03-09 01:25:15 | VPN outage identified for non Ops team employees |
2025-03-09 01:33:03 | Problem identified: a filer has crashed |
2025-03-09 01:34:46 | Filer restart attempted |
2025-03-09 01:47:09 | Filer restart failed |
2025-03-09 02:16:21 | Responder dispatched to datacenter |
2025-03-09 03:31:11 | First report from datacenter – filer restarted manually after power disconnection |
2025-03-09 04:03:05 | Attempted restart failed to resolve the issue |
2025-03-09 04:15:51 | Start service storage failover to a different filer |
2025-03-09 05:37:27 | All impacted systems identified; we identifier that all emails are correctly queued and there is no possible data loss |
2025-03-09 06:40:04 | Additional responders arrive on site |
2025-03-09 07:01:41 | Firmware update started |
2025-03-09 07:15:07 | First critical service to respawn identified |
2025-03-09 07:20:55 | Firmware update failed |
2025-03-09 07:30:40 | Firmware update successful, but the problem is still persistent |
2025-03-09 07:41:11 | We identified that the firmware issue may be related to a PCI device, so we had to unrack the filer and remove all PCI devices |
2025-03-09 09:15:57 | We managed to get our monitoring bot back online |
2025-03-09 10:25:00 | We managed to recover the VPN so the support team could work correctly |
2025-03-09 16:49:15 | We managed to recover all the services except mailboxes |
2025-03-09 16:50:10 | We started recovering the mailboxes |
2025-03-10 10:29:06 | The filer was back online after multiple hardware changes, and VMs were also back online |
2025-03-10 11:30:15 | We identified that in some cases, some mail servers didn’t mount the mailbox NFS system and were storing the email locally. This incorrect mount resulted in all old emails disappearing from mailboxes |
2025-03-10 13:30:00 | We started restoring the appropriate partition on impacted mailboxes. As a result, customers could see the old emails but not the emails received during the incident; we started a procedure to correctly recover the emails that were stored on the wrong partition; no mail was leaked, and each mailbox had its email correctly segmented |
2025-03-12 17:00:00 | We managed to restore all the emails on each mailbox to a dedicated folder |
2025-03-13 09:00:15 | Replication issues identified on the quota db: This database stores the used space for each mailbox. We needed to recreate the database as the error was unresolvable. The decision is made to take this opportunity to spawn a new database that meets our new standards. |
2025-03-14 14:30:00 | We recreated the database. The Database creation needed injection of all the quotas from scratch, which meant recomputing the used space on all the mailboxes again. With a missing index, the quota update was causing a lock issue, which made Postfix crash and impacted the email service again. |
2025-03-14 16:30:00 | All mailboxes were operational again, and we decided to postpone all operations on the quoted and migrate it after the weekend. |
Analysis
Identification of the root cause of the disruption was complicated by several factors:
- The internal authentication system was impacted, so multiple internal services and bots were unable to work properly, this service is redundant with a keepalive that auto switches the services to the approriate machine. However, given that only the filer storage was unavailable, keepalive didn’t work as the services were still reachable via network and only in a degraded state without access to their disks.
- To add the complexity of the situation, the customer support team were also not able to operate as all of their tools were using either internal authentication or IP restrictions requiring a connection to the VPN
Remediation actions
After this incident multiple decisions were taken to minimize the likelihood of this incident recurring:
- First improve redundancy for all of our monitoring bots as without monitoring we are restricted in our ability to see what is going on and this clearly delayed our reaction time and made our decisions more challenging.
- Improve redundancy mechanisms by configuring automatic shutdown of VMs on impacted filers
- Make sure that all redundant services are distributed across several filers.
- Update documentation and exercise workaround procedures for outages of critical infrastructure systems such as authentication and networking.
- Increase the number of VMs of some services to allow for increased overhead to absorb traffic fluctuations if a subset of instances is unavailable.
- Increase redundancy of the VMs exposing the mailboxes to the customers
- We are working to make a switch from zfs system to ceph which will make us less exposed to hardware issues.
The incident was quite exceptional for many reasons including the time frame, we would like to acknowlege the professionalism of our teams as many were not on-call and volunteered to help on a Sunday.