Need email infrastructure? Try PostScale -- transactional email API built in the EU. PostScale

    DNS Failover Design Patterns - Health Checks, TTLs, and Multi-Provider Resilience

    Learn practical DNS failover patterns for active-passive, active-active, regional failover, multi-provider DNS, and disaster recovery.

    Updated

    TL;DR

    DNS failover changes answers or delegation when an endpoint, region, or provider fails. The main patterns are active-passive records, active-active regional pools, GeoDNS fallback, secondary DNS/multi-provider delegation, and manual disaster-recovery cutovers. DNS failover is constrained by TTLs and resolver caching, so it is not instant. Design health checks carefully, keep TTLs realistic, avoid flapping, and pair DNS failover with application retries.

    What you'll learn

    • Choose the right DNS failover pattern for an application
    • Understand TTL and resolver-cache constraints
    • Design health checks that avoid false failovers
    • Combine DNS failover with secondary DNS and application resilience

    DNS failover is the practice of changing DNS behaviour when something breaks: an endpoint, a region, a network, or an entire DNS provider.

    It is powerful because every internet client already uses DNS. It is limited because DNS is cached. A correct design respects both facts.

    What DNS Failover Can and Cannot Do

    DNS failover can:

    • Move new resolver lookups away from unhealthy endpoints
    • Route regions to fallback regions
    • Support disaster-recovery cutovers
    • Keep DNS available across providers with secondary DNS
    • Reduce manual incident work when health checks are reliable

    DNS failover cannot:

    • Instantly move every active user
    • Override cached answers before TTL expiry
    • Make unsafe application failover safe
    • Replace database replication or session strategy
    • Replace per-request load balancing

    Pattern 1: Active-Passive Endpoint Failover

    One endpoint serves traffic. Another is standby.

    api.example.com.  60  IN  CNAME  api-primary.example.net.

    On failure:

    api.example.com.  60  IN  CNAME  api-standby.example.net.

    Use when:

    • The standby can serve the same traffic
    • Recovery time of minutes is acceptable
    • Writes are replicated or paused safely

    Risks:

    • Cached primary answers continue until TTL expiry
    • Standby may be cold or under-tested
    • Split-brain if primary recovers but clients are mixed

    Pattern 2: Active-Active Regional Pools

    Multiple regions serve traffic at the same time.

    api.example.com.  300  IN  CNAME  api-eu.example.com.
    api.example.com.  300  IN  CNAME  api-us.example.com.

    Better implementations use GeoDNS or latency-based routing to return region-appropriate answers.

    Use when:

    • Regions are independently healthy
    • Data and sessions are region-safe
    • You want lower latency and resilience

    Risks:

    • More application complexity
    • Harder data consistency model
    • Region-specific incidents can affect only some users

    Pattern 3: GeoDNS Fallback

    Regional users normally get regional endpoints, with fallback rules.

    User regionNormal answerFallback
    EUapi-eu.example.comapi-us.example.com
    USapi-us.example.comapi-eu.example.com
    APACapi-apac.example.comapi-us.example.com

    Use when:

    • Regional latency matters
    • Regional outages should drain to another region
    • Compliance allows the fallback path

    Compliance-sensitive systems need explicit fallback rules. "Fail EU to US" may be unacceptable for some data classes.

    Pattern 4: Multi-Provider DNS

    List nameservers from more than one DNS provider at the registrar. One provider is primary; another is secondary via AXFR/IXFR.

    example.com.  NS  ns1.dnscale.eu.
    example.com.  NS  ns2.other-provider.net.

    Use when:

    • DNS-provider outage is a business risk
    • You need independent authoritative networks
    • You can keep zone data synchronized

    This pattern is covered in Primary DNS vs Secondary DNS and Multi-provider DNS deployment.

    Pattern 5: Manual Disaster-Recovery Cutover

    Some systems should not auto-fail over. A manual DNS cutover may be safer.

    Use manual cutover when:

    • Data recovery point matters more than speed
    • Failover can cause split-brain writes
    • Human validation is required
    • Legal/compliance review is needed before moving regions

    Prepare the DNS pieces before the incident:

    • Low-enough TTLs on DR names
    • Standby records pre-created
    • Runbook with exact commands
    • Access to registrar and DNS provider
    • Rollback plan

    Health Check Design

    Bad health checks cause bad failovers.

    Check:

    • DNS target resolves
    • TCP/TLS connection works
    • Certificate is valid and not expired
    • HTTP status is expected
    • A lightweight dependency path works
    • Response latency is below a threshold

    Avoid:

    • ICMP-only checks for web services
    • Deep checks that fail during harmless dependency noise
    • Single-probe locations
    • No hysteresis before failover or recovery

    TTL Strategy

    Record typeSuggested TTL
    Active failover alias60-300 seconds
    Regional routing records300 seconds
    Stable MX records1800-3600 seconds
    NS delegation at registrarOften controlled by parent zone; plan for hours

    Remember: NS delegation changes are slower than record changes because parent-zone and resolver caches are involved. For fast application failover, change records inside the already-delegated zone, not registrar delegation.

    Anti-Flapping Controls

    Failover systems need dampening:

    • Require multiple failed checks before removal
    • Require sustained recovery before re-adding
    • Set minimum time between state changes
    • Use weighted ramp-up after recovery
    • Alert humans on every automatic failover

    Flapping is worse than a clean outage because it creates inconsistent client behaviour.

    Frequently asked questions

    Is DNS failover instant?
    No. Authoritative DNS can change an answer quickly, but recursive resolvers cache old answers until TTL expiry, and some clients cache too. Low TTLs reduce the window but do not eliminate it.
    What TTL should I use for DNS failover?
    For active failover records, 60-300 seconds is common. Lower TTLs increase authoritative query volume and may be clamped by resolvers. Use 300 seconds unless you have a clear need and monitoring for the extra query load.
    What should health checks test?
    Test the user-visible service, not just ping. For a web API, check HTTPS, certificate validity, response code, and a lightweight dependency path. For mail, check SMTP readiness and MX target reachability. Avoid checks that are so deep they fail during minor dependency blips.
    Can DNS failover replace a load balancer?
    No. DNS failover is coarse and cached. Load balancers make per-request decisions and can remove backends immediately. DNS failover is useful for region, provider, and disaster-recovery switching, not single-request balancing.
    How does secondary DNS fit into failover?
    Secondary DNS keeps authoritative service available if one DNS provider fails. It does not automatically fail over your application endpoints unless the zone data or policies also change. Use secondary DNS for DNS-provider resilience and record failover for application resilience.

    Ready to manage your DNS with confidence?

    DNScale provides anycast DNS hosting with a global network, real-time analytics, and an easy-to-use API.

    Start free