Why is DNS failover not recommended?

By ‘DNS failover’ I take it you mean DNS Round Robin combined with some monitoring, i.e. publishing multiple IP addresses for a DNS hostname, and removing a dead address when monitoring detects that a server is down. This can be workable for small, less trafficked websites.

By design, when you answer a DNS request you also provide a Time To Live (TTL) for the response you hand out. In other words, you’re telling other DNS servers and caches “you may store this answer and use it for x minutes before checking back with me”. The drawbacks come from this:

  • With DNS failover, a unknown percentage of your users will have your DNS data cached with varying amounts of TTL left. Until the TTL expires these may connect to the dead server. There are faster ways of completing failover than this.
  • Because of the above, you’re inclined to set the TTL quite low, say 5-10 minutes. But setting it higher gives a (very small) performance benefit, and may help your DNS propagation work reliably even if there is a short glitch in network traffic. So using DNS based failover goes against high TTLs, but high TTLs are a part of DNS and can be useful.

The more common methods of getting good uptime involve:

  • Placing servers together on the same LAN.
  • Place the LAN in a datacenter with highly available power and network planes.
  • Use a HTTP load balancer to spread load and fail over on individual server failures.
  • Get the level of redundancy / expected uptime you require for your firewalls, load balancers and switches.
  • Have a communication strategy in place for full-datacenter failures, and the occasional failure of a switch / database server / other resource that cannot easily be mirrored.

A very small minority of web sites use multi-datacenter setups, with ‘geo-balancing’ between datacenters.

Leave a Comment