Postmortem: Paris DC hosting outage

10.09.2014 - written by  in Cloud

On Tuesday, 7 October, we experienced a series of serious incidents affecting some of the storage units in our Parisian datacenter. These incidents caused two interruptions in service for some of our customers, affecting both Simple Hosting instances and IaaS servers.

The combined effect of these interruptions represents the most serious hosting outage we’ve had in three years.

First and foremost, we want to apologize. We understand how disruptive this was for many of you, and we want to make it right.

In accordance with our Service Level Agreement, we will be issuing compensation to those whose services were unavailable.

Here’s what happened:

On Tuesday, October 7, shortly before 8:00 p.m. Paris time (11:00 a.m. PDT), a storage unit in our Parisian datacenter housing a part of the disks of our IaaS servers and Simple Hosting instances became unresponsive.

At 8:00 p.m., after ruling out the most likely causes, we made the decision to switch to the backup equipment.

At 9:00 p.m., after one hour of importing data, the operation was interrupted, leading to a lengthy investigation that resulted in eventually falling back to the original storage unit. Our team, having determined the culprit to be the caching equipment, proceeded to change the disk of the write journal.

At 2:00 a.m., the storage unit whose disk had been replaced was rebooted.

Between 3:00 and 5:30 a.m., the recovery from a 6-hour outage caused a heavy overload, both on the network level and on the storage unit itself. The storage unit became unresponsive, and we were forced to restart the VMs in waves.

At 8:30 a.m., all the VMs and instances were once again functional, with a few exceptions which were handled manually.

We inspected our other storage units that were using the same model of disk, replacing one of them as a precaution.

At 12:30 p.m., we began investigating some slight misbehavior exhibited by the storage unit whose drive we had replaced as a precaution.

At 3:50 p.m., three virtual disks and a dozen VMs became unresponsive. We investigated and identified the cause, and proceeded to update the storage unit while our engineers worked on the fix.

Unfortunately, this update caused an unexpected automatic reboot, causing another interruption for the other Simple Hosting instances and IaaS servers on that storage unit.

By 4:15 p.m., all Simple Hosting instances were functional again, but there were problems remounting IaaS disks. By 5:30 p.m., 80% of the disks were accessible again, with the rest following by 5:45 p.m.

This latter incident lasted about two hours (4:00 to 6:00 p.m.). During this time, all hosting operations (creating, starting, or stopping servers) were queued.

Due to the large number of queued operations, it took until 7:30 p.m. for all of them to complete.

These incidents have seriously impacted the quality of our service, and for this we are truly sorry. We have already begun taking steps to minimize the consequences of such incidents in the future, and are working on tools to more accurately predict the risk of such hardware failures.

We are also working on a customer-facing tool for incident tracking which will be announced in the coming days. 

Thank you for using Gandi, and please accept our sincere apologies. If you have any questions, please do not hesitate to contact us.

The Gandi team