The Gandi Community

Simple Hosting incident

We experienced a hardware fault on routing equipment on the simple hosting platform.
Below is a chronology of the various events:
– 20:06 UTC : CPU load on the equipment shows significant increase.
– 20:06 UTC : Equipment is running at 100% CPU for no apparent reason, and has failed to respond to commands.
– 20:08 UTC : We made the decision to migrate to secondary equipment.
– 20:08 UTC : The secondary equipement exhibits the same symptoms as the primary, so traffic was not transferred.
– 20:09 UTC : Debugging underway as to ascertain the cause of the problem.
– 20:26 UTC : Migration to the now-stabilised secondary equipment.
– 20:27 UTC : Service returned to nominal operation.
– 22:42 UTC : Following this incident, there was a secondary effect on DNS resolution; the Simple Hosting instances failing to resolve DNS since 20:06 UTC.  the problem is now resolved.
– The network equipment used for the Gateways for this service are visibly showing signs of weakness.  An in-depth analysis of the anomaly and behaviour of the primary unit is underway (likely due to a memory fault).  We are currently running on the secondary gateway for the moment.