Postmortem: October 10, 2021 incident
On Sunday October 10th from 10:13 UTC to 18:16 UTC an incident impacted our main platform affecting the following services:
- www.gandi.net: offline until 16:39 UTC
- shop.gandi.net: offline until 16:39 UTC
- admin.gandi.net: offline until 16:39 UTC
- APIs: offline until 13:30 UTC
- Gandimail: unable to connect to email account, read or send emails until 17:30 UTC
During this window, customers were not able to place orders, manage or buy products through our website or APIs.Gandimail customers were also not able to access their emails.
Postmortem: what happened?
An electrical problem caused the loss of an electrical feed resulting in an overcurrent on the remaining feed.
The loss of an electrical feed should not have had an impact on our production but we were in the middle of a migration to replace old servers, and that led us to concentrate too many servers in one rack during this time.
In a normal situation, the electrical problem should not have impacted us, because redundancy of servers is enforced across several racks and with two electrical feeds per rack. Of course, we have a DRP option (Disaster Recovery Plan) to failover to another datacenter but we engage it only in the event of a major disaster like the full loss of a datacenter and data.
We should have avoided such concentration during the migration and we will put new procedures in place to avoid such cases in the future.
Tagged in Security