Alerts and incidents

Postmortem: September 30 storage incident

On September 30, 2020 at 05:38 UTC, one of our storage units used by our hosting services in FR-SD3 went down.

At 11:52 UTC we managed to bring the storage unit back online.

After that, our team has worked to recover all IAAS and PAAS impacted by the storage unit incident.

Timeline

All times listed are in UTC.

September 30 01:12 – a ZFS pool is degraded on a storage unit at FR-SD3

This is a healthy situation. A degraded pool means ZFS removed a disk due to too many errors on it.

September 30 01:30 – The On-duty SysOps replaces the failed disk with a spare one to allow rebuilding/resilvering of the raid.

September 30 02:42 – Alarms begin to ring

The situation is not clear at this point. Compute nodes are loading, and customers instances are not responding well.

September 30 03:12 – The On-duty Ops calls the On-call Ops for reinforcement

September 30 03:29 – After performing a diagnosis:

The failed disk is a ZFS Intent Log (ZIL) device. Since we had two ZIL devices in the mirror, before replacing the disk, the situation was stable—in case of a disk failure, performance is not impacted since the remaining disk continues working normally.

However, the faulty ZIL device is a specific SSD designed to cache writes, and the On-duty Ops mistakenly replaced it with a mechanical drive.

Because the performance between the two devices type is not the same, and because when adding a device to a mirror, the speed of the mirror will be that of the slowest device, all write operations on this filer have slowed down.

September 30 03:47 – After several attempts to remove the incorrectly selected ZIL disk, zpool commands have stopped responding. The On-call Ops has decided to go on site to physically remove the bad device from the storage unit.

September 30 04:47 – The On-call Ops is now on site.

September 30 04:50 – The bad disk has been physically removed from the storage unit.

September 30 05:00 – The situation for customers is improving, but the state of the storage unit is not healthy.

September 30 05:38 – We are rebooting the unit to reset it and be sure everything is fine.

September 30 05:45 – The storage unit is back online but the import is not succeeding as the second ZIL is also marked as faulty.

Our priority is now to get the pool back online.

We force the import, ignoring the bad ZIL, but the import is very slow.

We remove all the ZIL devices, adding a new one outside the mirror (since it is not the same size).

We don’t have any more spare for this specific device, so we have to take it from our stock office in Paris.

September 30 08:12 – Another Sysops is making their way to the datacenter with a spare ZIL device.

September 30 09:30 – The drives are now on site.

September 30 09:51 – The new ZIL slogs are in place. We have to wait for a resilver before we can take any further actions.

September 30 11:13 – The resilver is done. We have rebooted and the pool is importing ok and service is back online.

Analysis

The problem was mainly due to a human error, not a storage design problem.

The routine disk replacement procedure was not clear enough triggering the problem.

After that, the recovery of the problem we were facing was not straightforward, and we took time to avoid mistakes and risk data loss. As most of the recovery process is based on automation, we tried to avoid all non usual actions to get back to normal.

We preferred to stick with the procedure rather than try out risky workarounds, but this has taken us time.

We will upgrade our procedures accordingly to the problem we have faced.