Database Failover Strategies: Passive vs Active
For a financial system, data must be available 99.99% of the time. When the primary database fails, the system must automatically switch to a standby replica without losing data. This transition is known as Failover.
Replication Modes: The Consistency Trade-off
Choosing between Synchronous and Asynchronous replication depends entirely on your RPO (Recovery Point Objective).
| Mode | RPO | Latency | Use Case |
|---|---|---|---|
| Synchronous | 0 (No Data Loss) | High | Internal Ledger, Banking Core |
| Asynchronous | > 0 (Small Loss) | Low | Logging, User Profiles, Analytics |
[!CAUTION] Synchronous replication can cause "write amplification" where the Primary node becomes unresponsive if the network between nodes is unstable.
High-Availability Cluster Design
A modern HA setup requires a "Distributed Consensus" to avoid the dreaded Split-Brain.
Arch Note
Interactive logic enabled. Click components in expanded view for technical service definitions.
The Split Brain Problem
When two database nodes both think they are the "Primary" due to a network partition, they can both accept writes, leading to irreversible data corruption.
The Solution: A Quorum-based mechanism (N/2 + 1) or a dedicated Cluster Manager (like Patroni for PostgreSQL) that ensures only one node is elected as the leader at any time using Raft or Paxos algorithms.
Failure Scenario Analysis
| Incident | Detection | Resulting Action |
|---|---|---|
| Primary Process Crash | Immediate (PID lost) | Standby promoted within < 5s. |
| Network Partition | Timeout (Keep-alive) | Sentinel drops leader lock; new election begins. |
| Storage Failure | I/O Error | Node enters "Failed" state; manual intervention required. |
Critical Monitoring Metrics
To ensure a healthy failover, monitor these KPIs:
- Replication Lag (Bytes/Seconds): How far behind is the replica?
- Failover Attempt Counts: Frequency of automatic node switches.
- Disk I/O Latency: Ensure the replica can keep up with the primary's write throughput.
[!TIP] Engineering Advice: Always test your failover strategy in a "Chaos Engineering" experiment. Unplug a node in staging and observe how your application handles the connection reset.