Apply this skill when:
The core pattern: Recovery is not a single operation. It is five cooperating design decisions made before any incident occurs: (1) choosing how rollbacks are controlled to prevent reintroducing vulnerabilities, (2) decoupling rollout velocity from rollout policy so speed can be adjusted without code changes, (3) eliminating wall-clock time dependencies that break recovery under clock skew, (4) designing an explicit revocation mechanism that distributes state rather than requiring real-time coordination, and (5) provisioning emergency access infrastructure that survives normal-path failures.
This skill depends on resilience-and-blast-radius-design. The compartmentalization and failure domain concepts from that skill apply here: revocation should be compartmentalized (half-machine self-protection during KRL updates), and emergency access must be designed as a low-dependency component that avoids the failure domains of the primary access path.
Before starting, confirm you have:
If a system description is provided, scan for:
Before selecting mechanisms, establish the recovery context that governs all downstream decisions.
| Type | Characteristics | Recovery complexity |
|---|---|---|
| --- | --- | --- |
| Application software | Updated by an external package manager; killed and restarted by external tooling | Lower — rollback is managed outside the component |
| System software | Self-updating (package management daemon, OS image); updates itself by overwriting its executable | Higher — must be able to update itself without losing the ability to continue updating |
| Firmware | Self-updating with hardware constraints; may use one-time-programmable memory for version tracking; spare parts introduce old-version risk | Highest — rollback may be physically infeasible; hardware key storage is limited |
Why: Self-updating components present the hardest recovery design problem. They can actively prevent themselves from being updated if maliciously modified. The rollback control mechanism must work even for these components, but the "intended behavior" (what rollback means) is itself ambiguous.
Recovery means returning to a known good state. Define that state before choosing mechanisms:
Why: The more thoroughly the intended state is encoded and compared against the actual state, the easier it is to detect deviations, trigger automated repair, and confirm recovery is complete. Systems that do not know their intended state cannot verify that recovery succeeded.
Output for this step: Component type classification and intended state definition.
For self-updating components (and application software where security patch coverage is required), a bare "allow arbitrary rollbacks" policy reintroduces known vulnerabilities. A bare "never allow rollbacks" policy removes the path to a stable state when a bad update is deployed. The three mechanisms below provide controlled middle ground.
Prerequisite for all three mechanisms: Updates are cryptographically signed and the signature covers the component image and its version metadata.
| Mechanism | How it works | Best for | Key limitation |
|---|---|---|---|
| --- | --- | --- | --- |
| Key rotation | Rotating the update signing key invalidates older releases that were signed with the old key | Long-term hygiene; recovering from signing key compromise | Sudden rotation disrupts reliability; gradual rotation (dual-key overlap) adds complexity; hardware devices with OTP key storage have very limited rotation budget |
| Deny list | A list of version identifiers (hashes or labels) that components refuse to install | Quick incident response; blocking specific bad versions | Vulnerable to "unzipping" (incremental rollback through versions not yet on the list); list grows without bound; garbage collection is operationally complex |
| Minimum acceptable security version number (downgrade prevention) | Each release carries a security version number (SVN); a separately maintained floor value (stored in component state, not inside the release) prevents installation of releases with an SVN below the floor | Permanent exclusion of vulnerable version ranges; garbage-collecting deny list entries | Floor value is ratcheted forward by new releases (not by humans directly); requires ordered version values (not cryptographic hashes) |
Start with key rotation — all healthy organizations should implement key rotation regardless of the other choices. It is foundational cryptographic hygiene and validates the rotation process before it is needed in an emergency.
Add deny lists for incident response velocity. During an active incident, quickly appending a version identifier to a deny list is faster than raising the minimum acceptable security version number floor. Use deny lists as a rapid-response tool; use downgrade prevention as the cleanup mechanism afterward.
Add downgrade prevention (minimum acceptable security version number) when:
For self-updating components, combine all three where feasible, but introduce one at a time. Each mechanism adds operational complexity. Introducing all three simultaneously makes it difficult to identify bugs or corner cases in any individual mechanism.
Deny list storage: Encode the deny list outside the self-updating component (ComponentState[DenyList], not Release[DenyList]). A list stored inside the component is lost during updates and does not survive component replacement. The external list is the union of entries from all releases installed on that component.
Minimum acceptable security version number floor storage: Store the floor value in external component state (ComponentState[MASVN]), not inside any release. The floor value is ratcheted forward automatically: each time a new release initializes, it compares its own MASVN against the stored floor and updates the floor to the higher value. This means the floor only rises — it cannot be lowered by installing an older release.
Key rotation for hardware: Hardware devices with one-time-programmable (OTP) memory for storing public keys have a very limited key storage budget. Plan the number of key rotations needed across the device's lifetime before finalizing the hardware design. For spare parts that may have old firmware, support multiple signatures per release (old key and new key simultaneously) during the rotation transition window.
Output for this step: Selected mechanism(s) per component, with rationale. Implementation notes for deny list storage location and minimum acceptable security version number floor management.
Recovery speed and recovery risk are in direct tension: deploying changes faster reduces the window of exposure, but also reduces the time available for testing and increases the risk of deploying a broken patch. The correct architectural response is to decouple the ability to change quickly from the policy governing how quickly changes happen.
Design the update system to operate at the maximum conceivable speed. Then add a separate, independent rate-limiting service that constrains the actual rate of change according to current policy.
Why: If the rollout system's speed is determined by its internal policy parameters, then changing the policy requires modifying the rollout system — which is a code change subject to its own review, build, and release cycle. This makes emergency acceleration dependent on the system that is already under stress. Separating rate from action means that responding to an emergency requires only changing a rate limit, not rewriting infrastructure.
Design the rate-limiting service as:
Rate-limit token design: The rate-limiting service issues short-lived cryptographic tokens asserting that it has reviewed and approved a change at a certain time. The rollout actuator validates these tokens before applying changes. This communicates architecturally that change actuation is decoupled from change rate governance.
If the system uses epoch or version advancement (see Step 4 on time dependencies), hardcode a backstop rate limit for how frequently the epoch or version can advance — for example, no more than once per second for a 64-bit integer counter. This is an exception to the "make policy external" principle, but is justified: it is difficult to imagine a legitimate reason to advance system state more than once per second, and the backstop prevents an adversary with temporary system control from forcing an epoch rollover that renders the version floor mechanism useless.
Output for this step: Rate-limiting service design, token mechanism, and backstop rate limit value for epoch/version advancement if applicable.
Wall-clock time is a form of external state that the system cannot control. Recovery operations — replaying transactions, validating certificates, correlating logs across systems — are all vulnerable to clock skew, misconfigured NTP, certificate expiration edge cases, and leap-second bugs.
The fundamental problem: When recovering from a crash or rolling back to a previous state, any system component that checks wall-clock time may find that time has moved in unexpected ways. A recovery involving digitally signed transactions may fail if those transactions were signed by certificates that have since expired. Rolling back a database may require transaction replay — which fails if the database expects monotonically increasing timestamps.
For each place in the system where wall-clock time is used, evaluate replacement with one of:
| Alternative | How it works | When to use |
|---|---|---|
| --- | --- | --- |
| Rates | Express time-dependent limits as rates (events per second) rather than absolute timestamps | For throttling, rate limiting, and frequency policies |
| Epoch or version advancement | A monotonically increasing integer counter that advances by explicit policy action, not by clock progression | For validity windows, certificate freshness, ordering of events across distributed components |
| Validity lists | Explicit lists of valid-versus-revoked items, pushed periodically by a controlled process | For certificate validity, credential validity, and key status (see Step 5 for revocation design) |
Why epoch advancement beats wall-clock time for validity: With epoch advancement, certificates age without the system being tempted to skip certificate validity checking. Because you can halt epoch advancement in an emergency (freeze the epoch counter while investigating an issue), and because the validity system depends on your controlled push mechanism rather than on coordinated clocks across the fleet, recovery is not complicated by time.
Caution on epoch rollover: Aggressively incremented epoch values can roll over or overflow. Choose a sufficiently large integer range (64-bit) and apply a backstop rate limit (see Step 3b) to prevent an adversary from forcing rollover.
Flag for removal or replacement:
Exception: Wall-clock time is appropriate for deliberately time-bounded access — for example, requiring employees to re-authenticate daily, or enforcing a time-limited access grant. In these cases, design a repair path for the time-validation system that does not itself depend on wall-clock time.
Output for this step: List of wall-clock time dependencies in the system, replacements selected for each, and any legitimate exceptions with a documented repair path.
A revocation mechanism stops access by entities whose credentials have been compromised. It is most valuable during an active compromise — and that is precisely when it is most likely to be exercised under stress and with incomplete information. Design it before the incident.
| Model | How it works | Risk |
|---|---|---|
| --- | --- | --- |
| Centralized validity database | Every system checks credential validity against a central database before allowing access | If the database is down, all dependent systems fail; there is strong temptation to fail open, which creates an attack vector (denial-of-service the validity database to revalidate revoked credentials) |
| Locally cached revocation list | A revocation list is pushed periodically to all nodes; nodes use their local cache to make validity decisions | Nodes proceed on best-available information; revocation is eventually consistent rather than instantaneous; no single point of failure for access |
Recommendation: Prefer locally cached revocation list distribution over a centralized validity database. The centralized database creates a single point of failure and introduces the temptation to fail open. Distributing revocation data to nodes gives each node independence while maintaining eventual revocation of compromised credentials.
If the revocation service is unavailable or a node cannot reach it:
When updating a Key Revocation List (KRL) file, a naive implementation — blindly replacing the old file with the new one — allows a single push to revoke every valid credential in the infrastructure. An attacker with partial system control can use this against you.
Safeguard: Require each node to evaluate a new KRL before applying it. A KRL that would revoke the node's own credentials is refused. This guarantees that even a malicious KRL push can affect at most half of the fleet, preserving a recovery base.
Why this matters for recovery: Half-machine protection means the worst-case blast radius of a revocation push is bounded. You can recover the remaining half of the infrastructure and use it to remediate the first half — far easier than recovering everything from scratch.
The temptation during incident response is to build a separate, faster emergency revocation list to supplement the normal one. Avoid this: rarely-used mechanisms are less likely to work when most needed, and a separate emergency mechanism adds complexity without proportional benefit.
Instead: Shard the normal revocation list. Revoking a credential during an emergency requires updating only the relevant shard, not the entire list. Because the system always uses a multi-part revocation list (even under normal conditions), the mechanism is exercised regularly and is reliable under emergency conditions.
Explicit revocation removes the dependency on accurate time for certificate validation. A certificate that is "expired" by wall-clock time but present on no revocation list is effectively valid under a pure revocation model. This eliminates the failure mode where clock skew accidentally invalidates legitimate certificates during a recovery, or where an attacker manipulates clocks to revalidate revoked certificates.
Output for this step: Revocation distribution model, safe failure behavior definition, self-protection safeguard for KRL updates, sharding strategy, and whether wall-clock time is eliminated from certificate validity.
When normal access methods are completely unavailable — the SSO service is down, the VPN is broken, the primary credential service is unreachable — responders must still be able to act. Emergency access infrastructure is what makes recovery possible when the access path itself has failed.
Emergency access cannot depend on the same dynamic credential services as normal access. Design an alternative access path that:
Credential lifetime tradeoff: Short-lived emergency credentials enforce good security hygiene, but if the outage outlasts the credential lifetime, responders are locked out. Set credential lifetime to exceed the anticipated maximum outage duration, even though this extends the window during which a compromised credential is valid.
Emergency network access must survive failures in the layers outside your control:
Why geographic distribution matters: During a global outage, global coordination may be impossible. If each geographic region can begin recovery independently on the infrastructure it can reach, recovery propagates outward from working points rather than waiting for a globally coordinated restart.
Emergency communication channels must be evaluated for availability when the primary chat or collaboration service is itself unavailable or compromised:
Emergency access technology provides no benefit if responders do not know how to use it under stress. The cognitive load of switching to unfamiliar tools during a high-stress incident can render emergency access effectively unusable.
Minimize the distinction between normal and emergency processes: When emergency access uses the same underlying tools as normal access (for example, an emergency mode added to the browser extension used for normal remote access), responders can draw on habit rather than recall during an incident.
Require regular exercises:
Why validation is non-optional: Humans, not technology, are most likely to render emergency access ineffective. A responder who has not used the emergency access path in six months and has let their credentials expire provides no incident response capacity, even if the infrastructure is sound.
Output for this step: Emergency access design covering access controls (credential model, provisioning, lifetime), network (static vs. dynamic, geographic distribution), communications (tool selection, dependency audit), and responder validation plan.
Synthesize findings from Steps 1–6 into a structured document.
Document sections:
Three rollback policies exist, and only one is correct:
No organization should accept either extreme for production systems.
Emergency push systems that are separate from the normal rollout pipeline will not work when needed, because they are never exercised. Design the emergency push path to be the normal push path operating at maximum rate with adjusted rate-limit policy. The same principle applies to emergency access — use the same platform with an emergency mode rather than a separate tool.
The rate-of-change policy is not part of the update mechanism itself. It is a separate service. This separation is what allows a security team to respond to an emergency by adjusting a rate limit rather than rewriting and redeploying rollout infrastructure during the incident.
A revocation mechanism that fails open under load is an attack vector, not a safety mechanism. A distributed denial-of-service attack on the revocation service reactivates all revoked credentials for the duration of the attack. Use locally cached revocation state so that node decisions are independent of real-time service availability.
Recovery from any error category — random, accidental, malicious, or software — requires returning to a known good state. The prerequisite for recovery is knowing what that state is. Encode the intended state per service, per host, per device. Monitor deviations continuously. Recovery is then the automated or manual process of repairing deviations — not a heroic reconstruction from incomplete memory.
Google initially built rollout tooling around a monthly cadence for its internal Linux base image, bundling timing and content together. As security patches multiplied (each package needing its own patch cadence), the monthly assumption broke the tooling's design.
The solution separated three concerns: the rollout rate and timing, the configuration store defining each machine's target state, and the rollout actuator applying updates per machine. Each concern was developed and updated independently. Emergency releases became a matter of adjusting rate limits on the existing rollout service, not rewriting infrastructure. The result: simpler, more useful, and safer — and the same tooling handles both normal cadence and emergency releases.
ComponentState[MASVN] advances to 5. Release i−1 (SVN 4) can no longer be installed on any component that has received release i+1.Effect: The security patch is now mandatory. Rollback to the vulnerable version is permanently blocked — on a per-component basis, without global coordination.
A KRL push is intercepted or corrupted such that the new KRL would revoke every valid SSH credential in the infrastructure. Without self-protection, a single push takes down the entire fleet.
With self-protection: each node evaluates the incoming KRL before applying it. Any KRL that would revoke the node's own credentials is refused. The worst-case outcome is that half of the fleet refuses the KRL push. The other half remains functional and can be used as a recovery base to remediate the first half.
Google's corporate network uses SSO, short-term credentials, and multi-party authorization. A failure in any of these components could prevent all employee remote access, including incident responders.
To address this, offline credentials were provisioned and alternative authentication algorithms deployed that do not depend on SSO. These credentials are pre-provisioned on a fixed schedule (not activated on demand), ensuring they are valid at incident start rather than just being created. They are restricted to the minimum set of responders who need immediate access, while the broader organization waits for the normal access control services to be restored.
resilience-and-blast-radius-design)references/rollback-mechanism-comparison.md for extended pseudocode examples for deny list and downgrade prevention implementationreferences/emergency-access-checklist.md for a per-responder emergency credential provisioning and exercise checklistCross-references:
resilience-and-blast-radius-design — failure domain design, compartmentalization axes, and low-dependency component patterns that emergency access must satisfysecurity-change-rollout-planning — rate-of-change tradeoffs during security patch deployment and rollout acceleration triggersThis skill is licensed under CC-BY-SA-4.0.
Source: BookForge — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.
Install related skills from ClawhHub:
clawhub install bookforge-resilience-and-blast-radius-designOr install the full book set from GitHub: bookforge-skills
共 1 个版本
暂无安全检测报告