Database Failover Strategies: Passive vs Active

For a financial system, data must be available 99.99% of the time. When the primary database fails, the system must automatically switch to a standby replica without losing data. This transition is known as Failover.

Replication Modes: The Consistency Trade-off

Choosing between Synchronous and Asynchronous replication depends entirely on your RPO (Recovery Point Objective).

Mode	RPO	Latency	Use Case
Synchronous	0 (No Data Loss)	High	Internal Ledger, Banking Core
Asynchronous	> 0 (Small Loss)	Low	Logging, User Profiles, Analytics

[!CAUTION] Synchronous replication can cause "write amplification" where the Primary node becomes unresponsive if the network between nodes is unstable.

High-Availability Cluster Design

A modern HA setup requires a "Distributed Consensus" to avoid the dreaded Split-Brain.

Live architecture

Compiled: v2.0-Production

Analyzing Schema...

Arch Note

Interactive logic enabled. Click components in expanded view for technical service definitions.

Layer.0 / Distributed_System_Viz

The Split Brain Problem

When two database nodes both think they are the "Primary" due to a network partition, they can both accept writes, leading to irreversible data corruption.

The Solution: A Quorum-based mechanism (N/2 + 1) or a dedicated Cluster Manager (like Patroni for PostgreSQL) that ensures only one node is elected as the leader at any time using Raft or Paxos algorithms.

Failure Scenario Analysis

Incident	Detection	Resulting Action
Primary Process Crash	Immediate (PID lost)	Standby promoted within < 5s.
Network Partition	Timeout (Keep-alive)	Sentinel drops leader lock; new election begins.
Storage Failure	I/O Error	Node enters "Failed" state; manual intervention required.

Critical Monitoring Metrics

To ensure a healthy failover, monitor these KPIs:

Replication Lag (Bytes/Seconds): How far behind is the replica?
Failover Attempt Counts: Frequency of automatic node switches.
Disk I/O Latency: Ensure the replica can keep up with the primary's write throughput.

[!TIP] Engineering Advice: Always test your failover strategy in a "Chaos Engineering" experiment. Unplug a node in staging and observe how your application handles the connection reset.