High TTL Causing Delayed DNS Failover
증상
- After a server migration, some users still reach the old IP address for hours or days
- `dig example.com A +short` returns the new IP, but production traffic still hits the old server
- Old server logs show continued requests long after DNS was updated
- Connection refused or timeout errors for users whose local resolver cached the old IP
- Application metrics show split traffic between old and new servers during the 'propagation window'
- `dig example.com A +short` returns the new IP, but production traffic still hits the old server
- Old server logs show continued requests long after DNS was updated
- Connection refused or timeout errors for users whose local resolver cached the old IP
- Application metrics show split traffic between old and new servers during the 'propagation window'
근본 원인
- DNS record TTL was set to 86400 (24 hours) or higher, causing clients to cache the old IP
- ISP recursive resolvers caching beyond the stated TTL (non-compliant behavior)
- Application-level DNS caching in JVM (default: indefinite) or Node.js not respecting TTL
- Local machine DNS cache (/etc/hosts, nscd, systemd-resolved) holding a stale entry
- TTL was not lowered in advance of the planned migration to shorten the propagation window
진단
**Step 1 — Check the current TTL on the record**
```bash
# Query authoritative NS for current TTL:
NS=$(dig example.com NS +short | head -1)
dig @$NS example.com A
# 'example.com. 86400 IN A 1.2.3.4' → TTL is 86400 (24h)
# Query a public resolver for its cached TTL (decrements as cache ages):
dig @8.8.8.8 example.com A
# 'example.com. 43200 IN A 1.2.3.4' → 12 hours left in cache
```
**Step 2 — Identify which clients still have the old IP**
```bash
# From your application server, check what IP the DNS resolves to:
python3 -c "import socket; print(socket.gethostbyname('example.com'))"
# Check if JVM is caching indefinitely (Java application):
# Default: networkaddress.cache.ttl=-1 (infinite cache)
java -XshowSettings:all -version 2>&1 | grep 'cache.ttl'
```
**Step 3 — Check ISP resolver compliance**
```bash
# Probe multiple resolvers for the cached TTL:
for RESOLVER in 8.8.8.8 1.1.1.1 9.9.9.9 208.67.222.222; do
echo -n "$RESOLVER: "
dig @$RESOLVER example.com A | grep -E '^example'
done
# Resolvers still returning old IP = their cache has not expired
```
**Step 4 — Check local machine and application caches**
```bash
# macOS — show cached DNS entries:
sudo dscacheutil -cachedump -entries Host
# Linux (systemd-resolved):
systemd-resolve --statistics
sudo journalctl -u systemd-resolved -n 20 | grep example.com
# Check /etc/hosts for a hardcoded old entry:
grep example.com /etc/hosts
```
**Step 5 — Estimate total propagation time**
```bash
# Maximum propagation = current TTL value on the old record before change
# Actual propagation = old TTL - time elapsed since update
# Check when the record was changed by looking at SOA serial:
dig example.com SOA +short
# Third field is the serial (YYYYMMDDNN format shows the date)
```
```bash
# Query authoritative NS for current TTL:
NS=$(dig example.com NS +short | head -1)
dig @$NS example.com A
# 'example.com. 86400 IN A 1.2.3.4' → TTL is 86400 (24h)
# Query a public resolver for its cached TTL (decrements as cache ages):
dig @8.8.8.8 example.com A
# 'example.com. 43200 IN A 1.2.3.4' → 12 hours left in cache
```
**Step 2 — Identify which clients still have the old IP**
```bash
# From your application server, check what IP the DNS resolves to:
python3 -c "import socket; print(socket.gethostbyname('example.com'))"
# Check if JVM is caching indefinitely (Java application):
# Default: networkaddress.cache.ttl=-1 (infinite cache)
java -XshowSettings:all -version 2>&1 | grep 'cache.ttl'
```
**Step 3 — Check ISP resolver compliance**
```bash
# Probe multiple resolvers for the cached TTL:
for RESOLVER in 8.8.8.8 1.1.1.1 9.9.9.9 208.67.222.222; do
echo -n "$RESOLVER: "
dig @$RESOLVER example.com A | grep -E '^example'
done
# Resolvers still returning old IP = their cache has not expired
```
**Step 4 — Check local machine and application caches**
```bash
# macOS — show cached DNS entries:
sudo dscacheutil -cachedump -entries Host
# Linux (systemd-resolved):
systemd-resolve --statistics
sudo journalctl -u systemd-resolved -n 20 | grep example.com
# Check /etc/hosts for a hardcoded old entry:
grep example.com /etc/hosts
```
**Step 5 — Estimate total propagation time**
```bash
# Maximum propagation = current TTL value on the old record before change
# Actual propagation = old TTL - time elapsed since update
# Check when the record was changed by looking at SOA serial:
dig example.com SOA +short
# Third field is the serial (YYYYMMDDNN format shows the date)
```
해결
**Fix 1 — Lower TTL before the next migration (proactive)**
```bash
# 24–48 hours before migration, reduce TTL to 60 seconds:
# In your DNS provider dashboard, change TTL from 86400 to 60
# Verify the new TTL is in effect:
dig @8.8.8.8 example.com A
# Should show: 'example.com. 60 IN A 1.2.3.4'
```
**Fix 2 — Force flush local DNS caches on client machines**
```bash
# macOS:
sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder
# Linux:
sudo systemd-resolve --flush-caches
# or:
sudo service nscd restart
# Windows:
ipconfig /flushdns
```
**Fix 3 — Fix JVM DNS caching (Java applications)**
```bash
# Add to JVM startup flags or java.security:
# -Dsun.net.inetaddr.ttl=60
# Or in code:
java.security.Security.setProperty("networkaddress.cache.ttl", "60");
java.security.Security.setProperty("networkaddress.cache.negative.ttl", "10");
```
**Fix 4 — Fix Node.js DNS caching**
```javascript
// Node.js caches DNS for the process lifetime by default.
// Use dns-ttl or dnscache package to respect TTL:
const dnscache = require('dnscache')({
enable: true,
ttl: 60,
cachesize: 1000
});
```
```bash
# 24–48 hours before migration, reduce TTL to 60 seconds:
# In your DNS provider dashboard, change TTL from 86400 to 60
# Verify the new TTL is in effect:
dig @8.8.8.8 example.com A
# Should show: 'example.com. 60 IN A 1.2.3.4'
```
**Fix 2 — Force flush local DNS caches on client machines**
```bash
# macOS:
sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder
# Linux:
sudo systemd-resolve --flush-caches
# or:
sudo service nscd restart
# Windows:
ipconfig /flushdns
```
**Fix 3 — Fix JVM DNS caching (Java applications)**
```bash
# Add to JVM startup flags or java.security:
# -Dsun.net.inetaddr.ttl=60
# Or in code:
java.security.Security.setProperty("networkaddress.cache.ttl", "60");
java.security.Security.setProperty("networkaddress.cache.negative.ttl", "10");
```
**Fix 4 — Fix Node.js DNS caching**
```javascript
// Node.js caches DNS for the process lifetime by default.
// Use dns-ttl or dnscache package to respect TTL:
const dnscache = require('dnscache')({
enable: true,
ttl: 60,
cachesize: 1000
});
```
예방
- Set DNS record TTL to 300 seconds (5 min) by default; reserve 86400 only for truly static IPs
- Lower TTL to 60 seconds 24–48 hours before any planned IP migration or failover
- Restore normal TTL (300–3600) after the migration is stable to reduce resolver load
- Configure JVM and Node.js DNS caches to respect TTL rather than cache indefinitely
- Use health-check-based DNS failover (Route 53 health checks, Cloudflare Load Balancing) for automatic IP switching without manual TTL management
- Lower TTL to 60 seconds 24–48 hours before any planned IP migration or failover
- Restore normal TTL (300–3600) after the migration is stable to reduce resolver load
- Configure JVM and Node.js DNS caches to respect TTL rather than cache indefinitely
- Use health-check-based DNS failover (Route 53 health checks, Cloudflare Load Balancing) for automatic IP switching without manual TTL management