You have a replication failure in progress or an anomaly you cannot explain. The symptom is one of: data that was written cannot be read back; data that was visible has disappeared; records arrive in causally impossible order; two nodes are writing simultaneously; a quorum read is stale despite the math being correct.
This skill imposes a diagnostic framework: every replication failure traces to one of three classes — leader failover pitfalls, replication lag anomalies, or quorum edge cases. Each class has a bounded set of mechanisms with known mitigations. The skill maps your symptoms to a class, narrows to the mechanism, and produces a remediation plan.
This is the companion to replication-strategy-selector. That skill helps you choose. This skill helps you diagnose. Use this one when something is already wrong, or when you want to audit a configuration for latent failure before it manifests.
Cross-references:
replication-strategy-selector — for choosing topology, sync mode, quorum values, and conflict resolution strategy from scratchdistributed-failure-analyzer — for failures whose root cause is a network fault, clock unreliability, or process pause (zombie leaders, LWW data loss via clock skew, cascading timeouts)consistency-model-selector — for selecting the right consistency and isolation guarantees to prevent a class of anomalies at the application layerBefore analysis, collect the following. Ask the user for any that are missing.
Required:
Useful:
If no codebase or configuration is available: accept a verbal description and produce an analysis. The report will note which findings are confirmed vs. inferred.
WHY: The three failure classes have different root causes and completely different mitigations. Treating a replication lag anomaly as a failover problem (or vice versa) leads to wasted effort and leaves the actual failure in place. Classification first prevents this.
Class A: Leader failover pitfalls
Applies to single-leader replication. A failover is the process of promoting a follower to be the new leader when the current leader fails. Automatic failover typically follows three steps: (1) detect the leader has failed via timeout, (2) elect a new leader (usually the most up-to-date replica, chosen by election or by a previously elected controller node), (3) reconfigure clients to route writes to the new leader and ensure the old leader becomes a follower if it recovers.
Failover is "fraught with things that can go wrong." The four documented failure modes are:
| Failure mode | Mechanism | Signal |
|---|---|---|
| --- | --- | --- |
| Async data loss | New leader was an async follower — had not received all writes from old leader before failure. Old leader's unreplicated writes are discarded. | Writes confirmed to client are missing after failover |
| Primary key conflict | New leader's autoincrement counter lagged behind old leader's. New leader reissues keys already assigned by the old leader. Any system keyed on these IDs (Redis cache, secondary DB, audit log) develops cross-system inconsistency. | Duplicate key errors or wrong data returned for existing IDs in external systems |
| Split brain | Old leader recovers and does not recognize the new leader. Both nodes accept writes simultaneously. Without a process to resolve conflicts, data is lost or corrupted. Some systems "shut down one node if two leaders are detected" — but if this mechanism is misconfigured, both nodes may shut down. | Two nodes both reporting as leader; writes going to both; diverging replica state |
| Timeout miscalibration | Timeout too short: unnecessary failovers under load spike or network glitch, making the situation worse. Timeout too long: prolonged unavailability during genuine failures. A temporary load spike can cause response time to exceed the timeout, triggering a failover that increases load further. | Repeated failovers during traffic spikes; or prolonged unavailability before failover triggers |
Class B: Replication lag anomalies
Applies to single-leader asynchronous replication with read-scaling (reads routed to followers). The replication lag — the delay between a write being applied on the leader and being reflected on a follower — may be milliseconds under normal conditions, but can grow to seconds or minutes under load or network issues. Three named anomaly patterns arise:
| Anomaly | Description | Mechanism | Named guarantee required |
|---|---|---|---|
| --- | --- | --- | --- |
| Read-after-write violation | User submits data; immediately reads it back; does not see it. From the user's perspective, the submission was lost. | Read was routed to a follower that had not yet received the write. | Read-after-write consistency (also called read-your-writes) |
| Monotonic reads violation | User reads data (e.g., a comment); reloads the page; the data is gone. Time appears to move backward. | Sequential reads were routed to different replicas with different lag. The second read went to a more-lagged replica that had not yet received the write the first read saw. | Monotonic reads |
| Consistent prefix reads violation | User sees causally related records in an impossible order — an answer appearing before the question it answers. | In a partitioned database where partitions operate independently, partition A (carrying the reply) had low lag and partition B (carrying the question) had high lag. The observer read partition A first. | Consistent prefix reads |
Class C: Quorum edge cases
Applies to leaderless (Dynamo-style) replication: Cassandra, Riak, Voldemort, DynamoDB. The quorum condition w + r > n is designed to ensure that at least one node in every read set has seen every acknowledged write. However, six scenarios break this guarantee in practice even when the condition is mathematically satisfied:
| Edge case | Mechanism |
|---|---|
| --- | --- |
| Sloppy quorum active | A network interruption isolated the client from the n "home" nodes for a value. Writes were accepted by w nodes outside the home set (sloppy quorum). Even though w + r > n, the r read nodes are the home nodes — they have not seen the writes yet. Hinted handoff has not completed. |
| Concurrent writes, no clear ordering | Two writes to the same key occurred simultaneously. The quorum condition does not determine which write happened first. If last-write-wins is the conflict resolution strategy, the write with the lower timestamp (possibly the causally later write, if clocks are skewed) is silently discarded. |
| Write concurrent with read | A write was in-flight when a read was issued. The write was reflected on some of the r replicas but not others. The read may return the old value, the new value, or — in the worst case — the read returns the old value and the write is subsequently applied, but a future read may still return the old value from a different replica subset. |
| Partial write success, no rollback | A write succeeded on some replicas but failed on others (e.g., disk full) and was reported as failed overall (fewer than w acknowledgements). The replicas that did succeed are not rolled back. Subsequent reads may or may not see the partially-written value. |
| Node restored from stale replica | A node carrying a new value fails and its data is restored from a replica carrying an old value. The number of replicas storing the new value falls below w, breaking the quorum condition retrospectively. |
| Timing edge cases at linearizability boundary | Even with w + r > n fully satisfied, quorum reads are not linearizable — there are race conditions where unlucky timing can produce stale reads. Quorums provide eventual consistency, not linearizability. |
WHY: The class narrows the diagnostic space; the specific mechanism determines which mitigation is effective. "Replication lag anomaly" does not tell you whether you need sticky routing, a timestamp-based threshold, or causal consistency at the partition level — only identifying the exact anomaly pattern does.
For Class A (leader failover):
Confirm which of the four failure modes is active by asking:
For Class B (replication lag anomalies):
Confirm which anomaly pattern is active by asking:
For each anomaly, identify whether the read routing layer can be changed (application-level) or whether the database must be configured to provide the guarantee (database-level).
For Class C (quorum edge cases):
Confirm which edge case applies:
WHY: Each mechanism has a specific mitigation. Applying the wrong mitigation (e.g., increasing quorum for a read-after-write problem in a single-leader system) wastes effort and may introduce new problems. This step matches mechanism to fix precisely.
Async data loss:
synchronous_standby_names, MySQL's semi-sync replication. Accept the latency cost on writes. Alternatively: use a consensus-based replication protocol (Group Replication, Galera) where a write is not confirmed until a majority has applied it.Primary key conflict:
AUTO_INCREMENT status before decommissioning it; if the old leader is unavailable, query all external systems for the highest key they have seen.Split brain:
Timeout miscalibration:
Read-after-write violation:
Multiple techniques can implement read-after-write consistency:
Monotonic reads violation:
Ensure that a given user's sequential reads always go to the same replica. The replica can be chosen based on a hash of the user ID rather than randomly. This ensures the user's observed state only moves forward in time.
Caveat: if the assigned replica fails, the user's reads must be rerouted to another replica. At that moment, monotonic reads may be violated for the duration until the new replica catches up. This is generally acceptable — the guarantee is "best effort" in the face of replica failure.
Consistent prefix reads violation:
Ensure that writes with causal dependencies are written to the same partition. This prevents the ordering inversion — if the question and answer go to the same partition, the follower will always apply them in the correct order.
If causally related writes cannot always be co-located on the same partition (because the data model makes this impractical): use a database or middleware layer that tracks causal dependencies explicitly (causal consistency via version vectors) and ensures that a read does not return a causally later write without also returning its causal prerequisites.
Sloppy quorum / hinted handoff in progress:
durable_writes = true in Cassandra, allow_offline_hnodes = false in Riak) to get strict quorum behavior at the cost of lower availability during network partitions.LWW + concurrent writes:
Partial write success:
Stale node restoration:
nodetool repair; Riak: riak-admin repair) before routing reads to it.Linearizability requirement:
transaction-isolation-selector).WHY: Latent replication failures exist in configuration and code before they manifest in production. Proactive scanning finds them at low cost. These are the specific patterns to search for.
Anti-pattern 1: Reads always routed to any random follower
# Look for: round-robin read balancing, random replica selection
# or: load balancer distributing reads across all replicas without session affinity
Risk: Read-after-write and monotonic reads violations under any replication lag. Any write may be invisible to a read that lands on a different, more-lagged follower.
Fix: Implement user-session sticky reads or timestamp-gated replica selection.
Anti-pattern 2: Autoincrement primary keys with asynchronous replication and external systems
# Look for: AUTO_INCREMENT columns, SERIAL columns, sequences
# combined with: external system (Redis, Elasticsearch, audit log) using the same IDs
# combined with: asynchronous replication with manual or automatic failover
Risk: After failover, the new leader reissues IDs that were already assigned by the old leader but not yet replicated. The external system retains entries for the old IDs; the new leader's records point to different data.
Fix: Use UUIDs or application-generated globally unique IDs. Or, ensure the autoincrement sequence is advanced past the old leader's maximum before the new leader begins accepting writes.
Anti-pattern 3: No fencing / STONITH configured for leader failover
# Look for: automatic failover configuration without a fencing mechanism
# e.g., Patroni without fencing, MHA without power fencing, manual failover runbooks
Risk: The old leader recovers and does not know it has been demoted. Both nodes accept writes. Split brain.
Fix: Configure a fencing mechanism. Test it periodically. See distributed-failure-analyzer for fencing token implementation details.
Anti-pattern 4: Sloppy quorums enabled in a system requiring read freshness
# Cassandra: read_repair_chance and dclocal_read_repair_chance < 1.0
# with no anti-entropy (nodetool repair) schedule
# Riak: allow_mult = false (no sibling handling) with sloppy quorums
# Voldemort: default config enables sloppy quorums
Risk: During and after network partitions, reads return stale data despite w + r > n being satisfied, because the w writes went to non-home nodes.
Fix: Either disable sloppy quorums (strict quorum mode) or implement application-layer awareness of hinted handoff status.
Anti-pattern 5: No anti-entropy process, relying solely on read repair
# Cassandra: nodetool repair not scheduled
# Voldemort: no anti-entropy configured
# Custom leaderless system: no background reconciliation
Risk: Values that are rarely read will diverge permanently across replicas. Read repair only runs when a value is actually read. Infrequently-read keys can remain stale indefinitely, violating durability guarantees.
Fix: Schedule regular anti-entropy runs. For Cassandra: nodetool repair on a weekly schedule (or more frequently for high-write workloads). Ensure the interval is shorter than the gc_grace_seconds (tombstone expiry period) to prevent deleted data from "coming back."
Output a structured report with:
"The read is stale — this must be a replication bug."
Stale reads in asynchronous replication are expected behavior, not a bug. The replication lag is working as designed. The issue is that the application assumed synchronous replication behavior but is running in asynchronous mode. The fix is application-level read routing, not replication reconfiguration.
"We set w + r > n so our reads must be consistent."
The quorum condition ensures overlap between write and read node sets under normal conditions. It does not guarantee freshness when: a sloppy quorum was used (writes went to non-home nodes), concurrent writes occurred with LWW resolution and clock skew, or a write partially succeeded. Quorums provide eventual consistency by default, not linearizability.
"The failover succeeded — why are there duplicate key errors?"
The promoted follower's autoincrement counter reflects the writes it received before failover. If the old leader had advanced its counter further (on writes not yet replicated), the new leader's counter is behind. When the new leader issues new IDs, it reuses IDs the old leader already assigned. This is especially dangerous when an external system (Redis, Elasticsearch) is keyed on these IDs — the external system retains entries for the old IDs, and the new leader's records now point to different data in the external system.
"We have two leaders — one of them must be wrong."
Both leaders may believe they are legitimate. The old leader did not receive the demotion message (it may have been partitioned when the new leader was elected, or the fencing mechanism failed to fire). The solution is not to query which one is "right" but to forcibly fence the old leader and then reconcile the writes it accepted during the split brain window.
"Monotonic reads just means we need stronger consistency."
Monotonic reads is a weaker guarantee than strong consistency. It only requires that a single user's reads do not observe an older state after having observed a newer state. It does not require that all users see the same state at the same time. Implementing it with sticky replica routing is significantly cheaper than requiring strong consistency across the cluster.
"The quorum write failed, so the data wasn't written."
A failed quorum write means fewer than w nodes acknowledged. But the nodes that did acknowledge are not rolled back. The write may be partially applied across some replicas. Subsequent reads may or may not return the partially-written value, depending on which replicas the read contacts. Applications that retry a failed write without making it idempotent can create inconsistencies.
Scenario: A team runs MySQL with a single-leader replication topology. During maintenance, a follower is promoted to leader. Shortly after, users start seeing other users' private data — profile photos and messages belonging to a different account.
Trigger: Security incident report. Immediate investigation required.
Process:
AUTO_INCREMENT counter started below the old leader's maximum. The new leader reissued IDs already assigned by the old leader. A Redis cache was storing user profile data keyed on MySQL row IDs. Redis entries for the old IDs now returned different users' profile data because the new leader's rows with those IDs belong to different users.AUTO_INCREMENT past that value. (d) Invalidate all Redis entries in the affected ID range. (e) Audit which users were served incorrect data and notify them.Output: Failure analysis report identifying primary key conflict as root cause. Immediate remediation steps. UUID migration plan. Updated failover runbook with autoincrement counter validation step.
Scenario: A social application uses a single-leader MySQL setup with five followers. Reads are distributed across all followers via a round-robin load balancer. Users regularly report that comments they just posted do not appear when they reload the page. Occasionally, a comment that was visible disappears and then reappears.
Trigger: Support ticket volume on "my posts disappear" exceeds threshold. Product team requests investigation.
Process:
last_write_at in the user session. After one minute, route to followers. (b) For monotonic reads: hash user ID to always route to the same follower. If that follower fails, reroute — accepting a brief monotonic reads violation during failover. (c) Scan the load balancer configuration to confirm round-robin routing is what is actually configured (vs. session-sticky or latency-aware routing).Output: Failure analysis report. Read routing change specification. Session-layer implementation plan for last_write_at tracking. Load balancer reconfiguration recommendation.
Scenario: A team runs Cassandra with n=3, w=2, r=2 (satisfying w + r > n). After a 20-minute network partition between two datacenters, quorum reads are returning values that are several minutes old. The partition healed 10 minutes ago.
Trigger: Monitoring alert showing read staleness exceeding acceptable threshold after a network event.
Process:
nodetool tpstats shows a non-empty hints queue. Monitor the queue draining rate.HintsService metrics). (b) For values that must be current: force a read repair by issuing a CONSISTENCY QUORUM read or running nodetool repair on the affected keyspace. (c) Long-term: evaluate whether sloppy quorums provide acceptable trade-offs for this keyspace. For data requiring freshness guarantees, configure LOCAL_QUORUM with durable_writes = true and disable sloppy quorums. For data tolerating eventual consistency, sloppy quorums increase availability and are appropriate.Output: Failure analysis report. Hinted handoff monitoring procedure. Decision framework for which keyspaces should use strict vs. sloppy quorums. Updated runbook for post-partition recovery validation.
references/failover-checklist.md — step-by-step leader failover checklist: pre-failover verification, the four failure modes and their per-mode checks, post-failover validation steps, and rollback procedurereferences/lag-anomaly-patterns.md — complete replication lag anomaly reference: read-after-write, monotonic reads, and consistent prefix reads — each with formal definition, concrete example, implementation techniques, and cross-device complexity considerationsreferences/quorum-edge-cases.md — the six quorum edge cases in detail: conditions that trigger each, detection signals, mitigation options, and the distinction between sloppy quorums (durability guarantee) and strict quorums (freshness guarantee)This skill is licensed under CC-BY-SA-4.0.
Source: BookForge — Designing Data-Intensive Applications by Martin Kleppmann.
Install related skills from ClawhHub:
clawhub install bookforge-replication-strategy-selectorOr install the full book set from GitHub: bookforge-skills
共 1 个版本
暂无安全检测报告