Postmortem of the failure of one hosting storage unit at LU-BI1 on January 8, 2020

On January 8, 2020 at 14:51 UTC, one of our storage units used for our hosting services hosted at LU-BI1 went down.

We managed to restore the data and bring services back online the morning of January 13.

The incident impacted—at most—414 customers hosted at our Luxembourg Facility.

First we’ll explain the situation with a bit more context.
Next, we’ll give you the technical timeline.
And third, we’ll provide the full post mortem.

1. The situation and our reflections on it

Gandi provides IAAS and PAAS services.

To store your data, we use a file system called ZFS, on top of which we use Nexenta or Freebsd as operating systems. Nexenta is used on the old version of our storage system, and we will be scheduling its migration.

To secure your data from hard drive failure we use ZFS with triple mirroring, meaning we can lose up to two thirds of disks on the entire pool.

ZFS allows customers to take snapshots, which are images of their disk, at any given moment.It allows customers to rollback changes in case of a mistake, like deleting a file for example.

But contractually, we don’t provide a backup product for customers. That may have not been explained clearly enough in our V5 documentation.

Gandi made every effort during the incident to communicate as close as possible to the event to keep its customers informed.

The status page was updated at each stage and shared via our gandinoc, gandi_net, and gandibar Twitter accounts.

At the same time, a situation update was published on our blog news.gandi.net on January 9, then updated on January 10 and 13.

We have identified areas for improvement internally in order to be even more fluid and responsive in near-real time.

2. Technical timeline

The storage unit affected uses ZFS on Nexenta (Solaris/Illumos kernel).

– January 8 14:51 UTC: One of our storage units hosted at LU-BI1 goes down.

– January 8 14:52 UTC: We engage our failover procedure.

– January 8 15:00 UTC: The usual procedure does not permit service recovery, the pool is FAULTED, indicating metadata corruption. We investigate.

– January 8 16:00 UTC: The problem may be related to a hardware problem. A team is sent to the facility.

– January 8 17:00 UTC: We change the hardware: no improvement.

– January 8 17:15 UTC:

zpool import -f <pool> doesn’t work
We try zpool import -fFX <pool> to let zfs find a good successive transaction group (txg) (Meaning: find a coherent state of the pool)

– January 8 17:30 UTC: zpool import -fFX <pool> is running, but slowly. At this rate, it will takes days.

– January 8 17:45 UTC: The US team is online and takes a fresh look at the problem.

– January 8 19:00 UTC: We decide to stop the import to see if there is a way to speed it up.

– January 8 20:00 UTC: We don’t find a solution. We re-run the import.
As disks are read at 3M/s, we estimate the duration of the operation to be up to 370 hours.
We continue to try and find solutions.

– January 8 22:35 UTC: We have no guarantee that we will be able to restore data nor regarding the duration of the process.We choose to warn customers that they should use backups if they have them.
The import with rewind option is still running.

At the same time, we dig up available documentation and repository codes.
We identify we can change some parameters related to zpool import -fFX <pool>,so we decide to change some values, using mdb, related to spa_load_verify*.
But our version of zfs is too old and the code does not implement those capacities.
We try to find the right txg manually but it does not solve the long pool time scan.

– January 9 15:00 UTC: We decide to use a recent version of ZFS On Linux.A team goes to the facility, we already have a server configured to use ZOL.We prepare a swap of the server handling the JBOD with the one running ZOL.

– January 9 16:00 UTC: We start the import using zpool import -fFX <pool> with the possibility now to avoid the whole scan of the pool : echo 0 > /sys/module/zfs/parameters/spa_load_verify_metadata

To speed up the scrub we modify the above variables

echo 2000 > /sys/module/zfs/parameters/zfs_vdev_scrub_max_active

and modify others variables

/sys/module/zfs/parameters/zfs_vdev_async_read_max_active

/sys/module/zfs/parameters/zfs_vdev_queue_depth_pct

/sys/module/zfs/parameters/zfs_scan_mem_lim_soft_fact

/sys/module/zfs/parameters/zfs_scan_vdev_limit

– January 9 17:30 UTC: The import is done on “read only” in order to not alter the data retrieved. But in doing so, we can’t take any snapshots. We then redo the import without “read only” option.

– January 9 20:00 UTC: The second import is done.We do a global snapshot of the pool that we copy on another storage unit for safekeeping.

– January 9 20:30 UTC: We encounter some errors during the copying of the data, so we have to proceed manually.

– January 9 21:30 UTC: We transfer the snapshot with a script, we estimate it will take 33 hours.
During the night the US team is monitoring the transfer and restarting it if needed.

– January 10 11:15 UTC: We have transferred half of the data.

– January 10 16:00 UTC: Still transferring but it is slower than expected.

– January 10 23:00 UTC: Transfer is done but it missed a lot of snapshots.
The dependencies between snapshots and their origin prevent a lot of transfer.We need to delete a lot of destination targets to retransfer it.

– From January 11 until 12 13:00 UTC: The data transfer is ongoing.

– January 12 13:00 UTC: Manual transfer is done.

– January 12 13:15 UTC: We launch an integrity check on the pool.

– January 12 20:25 UTC: Integrity check is ok.

– January 12 23:00 UTC: Everything is almost ready to bring the data online.We page infrastructure/hosting/customer care team to be ready for 07:00 UTC on January 13.

– January 13 07:00 UTC: We begin the procedure of bringing the data back online.

– January 13 08:30 UTC: Data are online.

– January 13 09:30 UTC: We start PAAS instances.

3. Postmortem : what happened?

We have ruled out human origin regarding logs and commands performed before the crash.

We have no clear explanation of the problem, only theories.

We think it may be due to a hardware problem linked to the server RAM.

We acknowledge the main problem was the duration.

1) Why was a storage unit down?

Due to a software or hardware crash leading to metadata corruption and a prolonged interruption of services.

2) Why was a storage unit down for so long?

We were unable to import the pool.

3) Why was the import of the pool not possible?

Due to a metadata corruption. The procedure to recover the data was not possible in a short period of time.

4) Why was the import not possible in a short period of time?

The version of zfs we are using on this filer did not have the option to avoid a full scan of the pool implemented. And for safety reasons, we chose to duplicate the data.

5) Why was this option not implemented?

This option is not available on this version of zfs.

6) Why is the version of zfs on this filer too old?

Because the unit is part of the last batch of filers to be migrated to a newer version.

Mid-term corrective actions plan:

Upgrade the ZFS version of the remaining storage units as planned during the year 2020.
Accurately document the data recovery procedure in case of metadata corruption.