Today (January 4th 2011), one of our routers went offline. This led to the partial and temporary loss of our network, impacting some of our services such as our website, SiteMaker, GandiBlogs, some email accounts, and all operations towards servers. Domain names did not encounter any unavailability, though some network paths to certain servers were unavailable.
The incident is currently being resolved, and services will progressively return to normal.
Please accept our apologies for the inconvenience.
UPDATE: Here is the technical explanation for yesterday’s network incident:
Part of the Gandi France network is based on legacy topologies built over the past ten years, including multi-site spans for various VLANs and in some cases a relatively flat architecture. This part of the architecture relies, perhaps unwisely, on spanning-tree protocol to ensure a loop-free layer-2 topology in a bridged or switched network. Whilst we have have been performing various engineering works over the past 18 months to simplify the architecture, it takes a considerable amount of time to completely unbuild what has been built piece by piece over a period of ten years without significant outages of the Gandi services.
The incident yesterday was exacerbated by the legacy elements of the Gandi France network infrastructure and was caused by a fault in a downstream access switch cluster which created a layer-2 loop in the architecture. This in turn caused an unfortunate situation whereby the layer-2 topology of the legacy network was being constantly recalculated resulting in the spanning-tree protocol failing to converge, consuming 100% resources on the affected switches and thus preventing traffic flow. The offending switch cluster was isolated from the network, but we were also required to reload another switch in another datacentre to stop the “snowball” effect caused by the fault.
We have already scheduled for this quarter significant network engineering activities to finally unpick the remainder of the legacy topology and migrate to a fully hierarchical model limiting the layer-2 domains to locally contained subnets, and minimising the reliance upon such protocols as spanning-tree which was never designed to be used in such large scale designs in the first place. We will be communicating the dates and times of the maintenance windows over the coming weeks.
We apologise again for any inconvenience caused during this network incident yesterday.
( * spanning-tree protocol: http://en.wikipedia.org/wiki/Spanning_tree_protocol )Tagged in Security