Postmortem regarding the Network Incident from September 15, 2020 on IAAS and PAAS FR-SD3, FR-SD5, and FR-SD6

During scheduled maintenance aimed at replacing network equipment in the FR-SD5 datacenter, an error in the architecture and configuration caused two distinct service interruptions on PaaS and IaaS in the FR-SD3, FR-SD5, and FR-SD6 datacenters during the day.

First incident

09:22 UTC: Layer 2 network loop introduced
09:30 UTC: Network loop removed, network infrastructure stabilized

Until 13:26 UTC, some PaaS and IaaS instances were not functioning correctly.

Second incident

14:15 UTC: Second layer 2 network loop introduced
14:26 UTC: Second network loop removed

Consequences

This incident had two consequences:

During the network loops, access to services was disrupted
Some VMs saw their disk(s) go read-only to avoid data corruption, following the network loops. Because of this, the VMs won’t work correctly until a filesystem check has been performed.

Postmortem: what happened?

Why were hosting services unavailable, and side effects introduced, over the course of September 15?

A layer 2 network loop was introduced during a network maintenance.

Why was a network loop introduced?

Because of an incorrect assumption regarding the architecture that we have on our datacenters: FR-SD3, FR-SD5, and FR-SD6. The architecture design for these datacenters was conceived so that they would be independent.

They are not supposed to have any links between them. In fact, having an extended layer 2 network between multiple datacenters is risky: problems with the layer 2 can be amplified, with larger impacts, and it’s also much easier to introduce network loops. Thus, during our scheduled maintenance, the replacement of the aforementioned equipment by technical teams completed the extended layer 2 and created a loop.

Before maintenance: ok

During maintenance: nok

After temporary fix: ok

After customer migration (target architecture)

Why, then, was there an extended layer 2 network between these datacenters?

During the migration of our old datacenter (FR-SD2), some design constraints forced us to establish this extended layer 2 to allow for IaaS and PaaS instances to contact their network gateways.

Why is this extended layer 2 network still there?

Because we haven’t finished centralizing customers’ instances on a single datacenter.

How did you neglect the importance of the extended layer 2 network between these datacenters?

Preparation for the maintenance was done mostly by looking at the architecture documentation, and we failed to pay as much attention as we should have to this temporary exception.

Remediation

Short term, to finish the network maintenance

Completion of the migration while taking the restrictions of the extended L2 network into account.

Mid term

Migration of the instances still using this extended L2 network so it can be removed cleanly.

Postmortem regarding the Network Incident from September 15, 2020 on IAAS and PAAS FR-SD3, FR-SD5, and FR-SD6

First incident

Second incident

Consequences

Postmortem: what happened?

Why were hosting services unavailable, and side effects introduced, over the course of September 15?

Why was a network loop introduced?

Why, then, was there an extended layer 2 network between these datacenters?

Why is this extended layer 2 network still there?

How did you neglect the importance of the extended layer 2 network between these datacenters?

Remediation

Short term, to finish the network maintenance

Mid term

You may be interested in the following articles

Postmortem: October 10, 2021 incident

Gandi’s email service impacted by “UCEPROTECT” blacklisting

Price increase on .io on January 21, 2021

Should I put a dash in my domain name?

Subscribe to our newsletter