DNS Failover Design Patterns - Health Checks, TTLs, and Multi-Provider Resilience
Learn practical DNS failover patterns for active-passive, active-active, regional failover, multi-provider DNS, and disaster recovery.
TL;DR
DNS failover changes answers or delegation when an endpoint, region, or provider fails. The main patterns are active-passive records, active-active regional pools, GeoDNS fallback, secondary DNS/multi-provider delegation, and manual disaster-recovery cutovers. DNS failover is constrained by TTLs and resolver caching, so it is not instant. Design health checks carefully, keep TTLs realistic, avoid flapping, and pair DNS failover with application retries.
What you'll learn
- Choose the right DNS failover pattern for an application
- Understand TTL and resolver-cache constraints
- Design health checks that avoid false failovers
- Combine DNS failover with secondary DNS and application resilience
DNS failover is the practice of changing DNS behaviour when something breaks: an endpoint, a region, a network, or an entire DNS provider.
It is powerful because every internet client already uses DNS. It is limited because DNS is cached. A correct design respects both facts.
What DNS Failover Can and Cannot Do
DNS failover can:
- Move new resolver lookups away from unhealthy endpoints
- Route regions to fallback regions
- Support disaster-recovery cutovers
- Keep DNS available across providers with secondary DNS
- Reduce manual incident work when health checks are reliable
DNS failover cannot:
- Instantly move every active user
- Override cached answers before TTL expiry
- Make unsafe application failover safe
- Replace database replication or session strategy
- Replace per-request load balancing
Pattern 1: Active-Passive Endpoint Failover
One endpoint serves traffic. Another is standby.
api.example.com. 60 IN CNAME api-primary.example.net.On failure:
api.example.com. 60 IN CNAME api-standby.example.net.Use when:
- The standby can serve the same traffic
- Recovery time of minutes is acceptable
- Writes are replicated or paused safely
Risks:
- Cached primary answers continue until TTL expiry
- Standby may be cold or under-tested
- Split-brain if primary recovers but clients are mixed
Pattern 2: Active-Active Regional Pools
Multiple regions serve traffic at the same time.
api.example.com. 300 IN CNAME api-eu.example.com.
api.example.com. 300 IN CNAME api-us.example.com.Better implementations use GeoDNS or latency-based routing to return region-appropriate answers.
Use when:
- Regions are independently healthy
- Data and sessions are region-safe
- You want lower latency and resilience
Risks:
- More application complexity
- Harder data consistency model
- Region-specific incidents can affect only some users
Pattern 3: GeoDNS Fallback
Regional users normally get regional endpoints, with fallback rules.
| User region | Normal answer | Fallback |
|---|---|---|
| EU | api-eu.example.com | api-us.example.com |
| US | api-us.example.com | api-eu.example.com |
| APAC | api-apac.example.com | api-us.example.com |
Use when:
- Regional latency matters
- Regional outages should drain to another region
- Compliance allows the fallback path
Compliance-sensitive systems need explicit fallback rules. "Fail EU to US" may be unacceptable for some data classes.
Pattern 4: Multi-Provider DNS
List nameservers from more than one DNS provider at the registrar. One provider is primary; another is secondary via AXFR/IXFR.
example.com. NS ns1.dnscale.eu.
example.com. NS ns2.other-provider.net.Use when:
- DNS-provider outage is a business risk
- You need independent authoritative networks
- You can keep zone data synchronized
This pattern is covered in Primary DNS vs Secondary DNS and Multi-provider DNS deployment.
Pattern 5: Manual Disaster-Recovery Cutover
Some systems should not auto-fail over. A manual DNS cutover may be safer.
Use manual cutover when:
- Data recovery point matters more than speed
- Failover can cause split-brain writes
- Human validation is required
- Legal/compliance review is needed before moving regions
Prepare the DNS pieces before the incident:
- Low-enough TTLs on DR names
- Standby records pre-created
- Runbook with exact commands
- Access to registrar and DNS provider
- Rollback plan
Health Check Design
Bad health checks cause bad failovers.
Check:
- DNS target resolves
- TCP/TLS connection works
- Certificate is valid and not expired
- HTTP status is expected
- A lightweight dependency path works
- Response latency is below a threshold
Avoid:
- ICMP-only checks for web services
- Deep checks that fail during harmless dependency noise
- Single-probe locations
- No hysteresis before failover or recovery
TTL Strategy
| Record type | Suggested TTL |
|---|---|
| Active failover alias | 60-300 seconds |
| Regional routing records | 300 seconds |
| Stable MX records | 1800-3600 seconds |
| NS delegation at registrar | Often controlled by parent zone; plan for hours |
Remember: NS delegation changes are slower than record changes because parent-zone and resolver caches are involved. For fast application failover, change records inside the already-delegated zone, not registrar delegation.
Anti-Flapping Controls
Failover systems need dampening:
- Require multiple failed checks before removal
- Require sustained recovery before re-adding
- Set minimum time between state changes
- Use weighted ramp-up after recovery
- Alert humans on every automatic failover
Flapping is worse than a clean outage because it creates inconsistent client behaviour.
Related Guides
Frequently asked questions
- Is DNS failover instant?
- No. Authoritative DNS can change an answer quickly, but recursive resolvers cache old answers until TTL expiry, and some clients cache too. Low TTLs reduce the window but do not eliminate it.
- What TTL should I use for DNS failover?
- For active failover records, 60-300 seconds is common. Lower TTLs increase authoritative query volume and may be clamped by resolvers. Use 300 seconds unless you have a clear need and monitoring for the extra query load.
- What should health checks test?
- Test the user-visible service, not just ping. For a web API, check HTTPS, certificate validity, response code, and a lightweight dependency path. For mail, check SMTP readiness and MX target reachability. Avoid checks that are so deep they fail during minor dependency blips.
- Can DNS failover replace a load balancer?
- No. DNS failover is coarse and cached. Load balancers make per-request decisions and can remove backends immediately. DNS failover is useful for region, provider, and disaster-recovery switching, not single-request balancing.
- How does secondary DNS fit into failover?
- Secondary DNS keeps authoritative service available if one DNS provider fails. It does not automatically fail over your application endpoints unless the zone data or policies also change. Use secondary DNS for DNS-provider resilience and record failover for application resilience.
Related guides
What is an Anycast DNS Network?
Learn how anycast networking works, why it matters for DNS, and how it delivers low-latency, resilient name resolution worldwide.
Anycast DNS vs Unicast DNS ā Which Is Better for Your Domain?
Compare anycast and unicast DNS routing to understand which approach delivers better performance, resilience, and DDoS protection for your domain.
What is Anycast DNS? A Plain-Language Guide
Anycast DNS explained from the ground up ā what it is, why it matters, how BGP routing makes one IP reachable from many places, and why every modern DNS provider runs it.
DNS Network Performance Monitoring
How DNScale measures real-time DNS response times from independent RIPE Atlas probes across backbone and last-mile networks worldwide.
Ready to manage your DNS with confidence?
DNScale provides anycast DNS hosting with a global network, real-time analytics, and an easy-to-use API.
Start free