Incident report: abc DNS cluster

Sep 9, 2019  - written by  in Cloud

On September 7 and 8, 2019, we faced two incidents with our DNS plateform {a,b,c}.dns.gandi.net.

To sum up

  • A DDoS created a network instability in our Paris facilities.
  • The isolation of Paris DNS clusters created a blackhole for DNS queries for {a,b,c} DNS on hosting network.
  • Misinterpretation of a monitoring probe led to ignoring it, which hindered the detection of the problem.

Actions to be taken:

  • Modify our network in Paris to avoid the instability problem
  • Automatize network isolation
  • Review our incident management procedure/training

Timeline

  • Sept 7 20:03 UTC: We began receiving alerts from our external monitoring about a problem with c.dns.gandi.net, no answers had been provided by the cluster for a few seconds. a.dns.gandi.net and b.dns.gandi.net were still operational.
  • Sept 7 20:09 UTC: On duty Gandi Ops has been on deck and called for reinforcements following the procedure in case of potential customer impact.
  • Sept 7 20:13 UTC: Reinforcements are connected
  • Sept 7 20:18 UTC: After analysis, we found out we were facing a DDoS. No alarm about unusual traffic. Our DDoS protection tools didn’t discover anything in particular.
  • Sept 7 20:19 UTC: We were facing a noisy monitoring environment due to some network congestion in the Paris Datacenter, since the DDoS was having some side effects on other internal components of our infrastructure.We started looking for a potential target in our network to mitigate the attack.Since we were not able to find the root-cause,  we decided to stop the BGP announcements for {a,b,c} from our Paris datacenters since the network was not stable.
  • Sept 7 20:42 UTC:  Network Ops were called to monitor the isolation.
  • Sept 7 20:45 UTC: Network Ops were connected and we began to stop BGP announcements for our Paris datacenters.
  • Sept 7 22:00 UTC: Situation was stable, isolation had been more complicated and took more time than expected. From then, we secured the {a,b,c} DNS cluster but in the process triggered another incident on our Paris hosting network, where our IaaS and PaaS services are hosted (not known at this moment). All monitoring was green, DNS latencies from our external monitoring were good. However, there was a problem in the way we isolated the DNS clusters. The Paris hosting network was still seeing internal BGP announcements from Paris but was not able to reach the DNS servers. We had created a blackhole for all DNS queries to {a,b,c}.dns.gandi.net coming from our Paris hosting network.
  • Sept 8 10:20 UTC: Thanks to feedback from customers on Twitter, Ops started investigating the DNS issue they reported.
  • Sept 8 10:28 UTC: Problem is confirmed, Network Ops are paged.
  • Sept 8 10:33 UTC: Network Ops were connected and started analyzing the issue.
  • Sept 8 10:37 UTC: We decided to revert the previous day’s mitigation.
  • Sept 8 10:53 UTC: Reverting did not work as expected, the attack resumed a couple of minutes later.
  • Sept 8 11:01 UTC: We called in more reinforcements on the Network Ops side.
  • Sept 8 11:32 UTC: We fixed the DNS resolution problem from our hosting network.