Alerts and incidents

Gandi incident on March 9, 2025

post mortem incident gandi

On Sunday 2025-03-09, Gandi experienced a major incident on its platform caused by a filer storage system outage affecting multiple services including mailboxes.

What was the root cause of the incident?

The main cause was the failure of an SSD storage filer. However, several additional factors contributed to the severity of the impact:

  • Some systems, including internal monitoring, lacked effective redundancy measures to cope with the storage disruption
  • Some systems which did have redundancy at the VM level were incorrectly architected so that all the VMs relied on the single impacted filer
  • Some systems that were redundant at both the VM and storage level were not provisioned with enough capacity to handle the increased load when one of the instances failed.

Full timeline:

Time stamps (UTC)Event
2025-03-09 00:31:10Incident started, and on-call responders began investigating over 1500 alerts; difficult to know what was the root cause, and the monitoring bot was unavailable
2025-03-09 01:11:19Incident was escalated and CTO responded
2025-03-09 01:21:51Public status published on status.gandi.net with the first impacted services identified
2025-03-09 01:23:31Attempt to declare incident via ChatOps tooling
2025-03-09 01:25:15VPN outage identified for non Ops team employees
2025-03-09 01:33:03Problem identified: a filer has crashed
2025-03-09 01:34:46Filer restart attempted
2025-03-09 01:47:09Filer restart failed
2025-03-09 02:16:21Responder dispatched to datacenter
2025-03-09 03:31:11First report from datacenter – filer restarted manually after power disconnection
2025-03-09 04:03:05Attempted restart failed to resolve the issue
2025-03-09 04:15:51Start service storage failover to a different filer
2025-03-09 05:37:27All impacted systems identified; we identifier that all emails are correctly queued and there is no possible data loss
2025-03-09 06:40:04Additional responders arrive on site
2025-03-09 07:01:41Firmware update started
2025-03-09 07:15:07First critical service to respawn identified
2025-03-09 07:20:55Firmware update failed
2025-03-09 07:30:40Firmware update successful, but the problem is still persistent
2025-03-09 07:41:11We identified that the firmware issue may be related to a PCI device, so we had to unrack the filer and remove all PCI devices
2025-03-09 09:15:57We managed to get our monitoring bot back online
2025-03-09 10:25:00We managed to recover the VPN so the support team could work correctly
2025-03-09 16:49:15We managed to recover all the services except mailboxes
2025-03-09 16:50:10We started recovering the mailboxes
2025-03-10 10:29:06The filer was back online after multiple hardware changes, and VMs were also back online
2025-03-10 11:30:15We identified that in some cases, some mail servers didn’t mount the mailbox NFS system and were storing the email locally. This incorrect mount resulted in all old emails disappearing from mailboxes
2025-03-10 13:30:00We started restoring the appropriate partition on impacted mailboxes. As a result, customers could see the old emails but not the emails received during the incident; we started a procedure to correctly recover the emails that were stored on the wrong partition; no mail was leaked, and each mailbox had its email correctly segmented
2025-03-12 17:00:00We managed to restore all the emails on each mailbox to a dedicated folder
2025-03-13 09:00:15Replication issues identified on the quota db: This database stores the used space for each mailbox. We needed to recreate the database as the error was unresolvable. The decision is made to take this opportunity to spawn a new database that meets our new standards.
2025-03-14 14:30:00 We recreated the database. The Database creation needed injection of all the quotas from scratch, which meant recomputing the used space on all the mailboxes again. With a missing index, the quota update was causing a lock issue, which made Postfix crash and impacted the email service again.
2025-03-14 16:30:00All mailboxes were operational again, and we decided to postpone all operations on the quoted and migrate it after the weekend.

Analysis

Identification of the root cause of the disruption was complicated by several factors:

  • The internal authentication system was impacted, so multiple internal services and bots were unable to work properly,  this service is redundant  with a keepalive that auto switches the services to the approriate machine. However, given that only the filer storage was unavailable, keepalive didn’t work as the services were still reachable via network and only in a degraded state without access to their disks.
  • To add the complexity of the situation, the customer support team were also not able to operate as all of their tools were using either internal authentication or IP restrictions requiring a connection to the VPN

Remediation actions 

After this incident multiple decisions were taken to minimize the likelihood of this incident recurring:

  • First improve redundancy for all of our monitoring bots as without monitoring we are restricted in our ability to see what is going on and this clearly delayed our reaction time and made our decisions more challenging. 
  • Improve redundancy mechanisms by configuring automatic shutdown of VMs on impacted filers
  • Make sure that all redundant services are distributed across several filers.
  • Update documentation and exercise workaround procedures for outages of critical infrastructure systems such as authentication and networking.
  • Increase the number of VMs of some services to allow for increased overhead to absorb traffic fluctuations if a subset of instances is unavailable.
  • Increase redundancy of the VMs exposing the mailboxes to the customers
  • We are working to make a switch from zfs system to ceph which will make us less exposed to hardware issues. 

The incident was quite exceptional for many reasons including the time frame, we would like to acknowlege the professionalism of our teams as many were not on-call and volunteered to help on a Sunday.

preload imagepreload image