Gandi incident on March 9, 2025

On Sunday 2025-03-09, Gandi experienced a major incident on its platform caused by a filer storage system outage affecting multiple services including mailboxes.

Summary:
– Multiple services were severely disrupted from Sunday March 9th 0:31:10 until 16:49:15 including 39% of all mailboxes
– Some Mailboxes (~15%) remained unavailable until Monday March 10th 10:29 . However, all users had recovered all of their emails by Wednesday March 12th 17:00:00.
– Importantly, this incident did not result in the loss or corruption of any data.

What was the root cause of the incident?:

The main cause was the failure of an SSD storage filer. However, several additional factors contributed to the severity of the impact:

Some systems, including internal monitoring, lacked effective redundancy measures to cope with the storage disruption

Some systems which did have redundancy at the VM level were incorrectly architected so that all the VMs relied on the single impacted filer

Some systems that were redundant at both the VM and storage level were not provisioned with enough capacity to handle the increased load when one of the instances failed.

Full timeline:

Time stamps (UTC)	Event
2025-03-09 00:31:10	Incident started, and on-call responders began investigating over 1500 alerts; difficult to know what was the root cause, and the monitoring bot was unavailable
2025-03-09 01:11:19	Incident was escalated and CTO responded
2025-03-09 01:21:51	Public status published on status.gandi.net with the first impacted services identified
2025-03-09 01:23:31	Attempt to declare incident via ChatOps tooling
2025-03-09 01:25:15	VPN outage identified for non Ops team employees
2025-03-09 01:33:03	Problem identified: a filer has crashed
2025-03-09 01:34:46	Filer restart attempted
2025-03-09 01:47:09	Filer restart failed
2025-03-09 02:16:21	Responder dispatched to datacenter
2025-03-09 03:31:11	First report from datacenter – filer restarted manually after power disconnection
2025-03-09 04:03:05	Attempted restart failed to resolve the issue
2025-03-09 04:15:51	Start service storage failover to a different filer
2025-03-09 05:37:27	All impacted systems identified; we identifier that all emails are correctly queued and there is no possible data loss
2025-03-09 06:40:04	Additional responders arrive on site
2025-03-09 07:01:41	Firmware update started
2025-03-09 07:15:07	First critical service to respawn identified
2025-03-09 07:20:55	Firmware update failed
2025-03-09 07:30:40	Firmware update successful, but the problem is still persistent
2025-03-09 07:41:11	We identified that the firmware issue may be related to a PCI device, so we had to unrack the filer and remove all PCI devices
2025-03-09 09:15:57	We managed to get our monitoring bot back online
2025-03-09 10:25:00	We managed to recover the VPN so the support team could work correctly
2025-03-09 16:49:15	We managed to recover all the services except mailboxes
2025-03-09 16:50:10	We started recovering the mailboxes
2025-03-10 10:29:06	The filer was back online after multiple hardware changes, and VMs were also back online
2025-03-10 11:30:15	We identified that in some cases, some mail servers didn’t mount the mailbox NFS system and were storing the email locally. This incorrect mount resulted in all old emails disappearing from mailboxes
2025-03-10 13:30:00	We started restoring the appropriate partition on impacted mailboxes. As a result, customers could see the old emails but not the emails received during the incident; we started a procedure to correctly recover the emails that were stored on the wrong partition; no mail was leaked, and each mailbox had its email correctly segmented
2025-03-12 17:00:00	We managed to restore all the emails on each mailbox to a dedicated folder
2025-03-13 09:00:15	Replication issues identified on the quota db: This database stores the used space for each mailbox. We needed to recreate the database as the error was unresolvable. The decision is made to take this opportunity to spawn a new database that meets our new standards.
2025-03-14 14:30:00	We recreated the database. The Database creation needed injection of all the quotas from scratch, which meant recomputing the used space on all the mailboxes again. With a missing index, the quota update was causing a lock issue, which made Postfix crash and impacted the email service again.
2025-03-14 16:30:00	All mailboxes were operational again, and we decided to postpone all operations on the quoted and migrate it after the weekend.

Analysis

Identification of the root cause of the disruption was complicated by several factors:

The internal authentication system was impacted, so multiple internal services and bots were unable to work properly, this service is redundant with a keepalive that auto switches the services to the approriate machine. However, given that only the filer storage was unavailable, keepalive didn’t work as the services were still reachable via network and only in a degraded state without access to their disks.
To add the complexity of the situation, the customer support team were also not able to operate as all of their tools were using either internal authentication or IP restrictions requiring a connection to the VPN

Remediation actions

After this incident multiple decisions were taken to minimize the likelihood of this incident recurring:

First improve redundancy for all of our monitoring bots as without monitoring we are restricted in our ability to see what is going on and this clearly delayed our reaction time and made our decisions more challenging.
Improve redundancy mechanisms by configuring automatic shutdown of VMs on impacted filers
Make sure that all redundant services are distributed across several filers.
Update documentation and exercise workaround procedures for outages of critical infrastructure systems such as authentication and networking.
Increase the number of VMs of some services to allow for increased overhead to absorb traffic fluctuations if a subset of instances is unavailable.
Increase redundancy of the VMs exposing the mailboxes to the customers
We are working to make a switch from zfs system to ceph which will make us less exposed to hardware issues.

The incident was quite exceptional for many reasons including the time frame, we would like to acknowlege the professionalism of our teams as many were not on-call and volunteered to help on a Sunday.

Gandi incident on March 9, 2025

What was the root cause of the incident?:

Full timeline:

Analysis

Remediation actions

You may be interested in the following articles

Postmortem: October 10, 2021 incident

Gandi’s email service impacted by “UCEPROTECT” blacklisting

Should I put a dash in my domain name?

How do I find all the subdomains of a domain name?

Subscribe to our newsletter