Recovery Mechanism Design

When to Use

Apply this skill when:

Designing a new component's update or rollback architecture from scratch
Reviewing an existing release pipeline for the security-reliability tradeoff: can the system roll back safely, and does rolling back reintroduce vulnerabilities?
Defining rollback policy for self-updating firmware or system software (package managers, BIOS, bootloaders, embedded firmware)
Designing a credential or certificate revocation mechanism
Provisioning emergency access infrastructure before an incident — not during one

The core pattern: Recovery is not a single operation. It is five cooperating design decisions made before any incident occurs: (1) choosing how rollbacks are controlled to prevent reintroducing vulnerabilities, (2) decoupling rollout velocity from rollout policy so speed can be adjusted without code changes, (3) eliminating wall-clock time dependencies that break recovery under clock skew, (4) designing an explicit revocation mechanism that distributes state rather than requiring real-time coordination, and (5) provisioning emergency access infrastructure that survives normal-path failures.

This skill depends on resilience-and-blast-radius-design. The compartmentalization and failure domain concepts from that skill apply here: revocation should be compartmentalized (half-machine self-protection during KRL updates), and emergency access must be designed as a low-dependency component that avoids the failure domains of the primary access path.

Before starting, confirm you have:

A description of the component(s) to be designed for recovery
Whether updates are cryptographically signed (required for the rollback control mechanisms)
The existing release pipeline, if one exists
Whether the component is self-updating (updates itself by overwriting its own executable or firmware image)

Context and Input Gathering

Required Context

Component description: What software or firmware component is being designed? Is it application software, system software (package manager, OS), or firmware (BIOS, NIC, embedded controller)?
Update signature model: Are updates cryptographically signed? Does the signature cover the component image and its version metadata?
Self-updating flag: Does the component update itself (overwriting its own executable), or is it updated by an external package manager?
Credential/certificate usage: What credentials or certificates does the system use? Are they issued by an internal certificate authority? What is their current validity model (expiration time, or revocation list)?

Observable Context

If a system description is provided, scan for:

Hard-coded rollback policies ("always allow" or "never allow") — both are known failure modes
Certificate or credential validity tied to wall-clock expiration without a revocation fallback
A single global rollout rate limit embedded in the rollout actuator — policy and mechanism are coupled
Emergency access paths that depend on the same SSO or credential service as normal access — no isolation
Revocation mechanisms that fail open when the revocation service is unavailable

Default Assumptions

If update signatures are not mentioned: treat the system as unsigned and note that the rollback control mechanisms in Step 2 require signing as a prerequisite
If rollback policy is unspecified: treat the system as "allow arbitrary rollbacks" and apply the full mechanism selection process
If no revocation mechanism exists: start with locally cached revocation list distribution rather than a centralized validity database
If no emergency access exists: apply the emergency access design in Step 5, beginning with access controls (highest priority)

Process

Step 1 — Classify the Component and Its Recovery Context

Before selecting mechanisms, establish the recovery context that governs all downstream decisions.

1a — Component Type

Type	Characteristics	Recovery complexity
---	---	---
Application software	Updated by an external package manager; killed and restarted by external tooling	Lower — rollback is managed outside the component
System software	Self-updating (package management daemon, OS image); updates itself by overwriting its executable	Higher — must be able to update itself without losing the ability to continue updating
Firmware	Self-updating with hardware constraints; may use one-time-programmable memory for version tracking; spare parts introduce old-version risk	Highest — rollback may be physically infeasible; hardware key storage is limited

Why: Self-updating components present the hardest recovery design problem. They can actively prevent themselves from being updated if maliciously modified. The rollback control mechanism must work even for these components, but the "intended behavior" (what rollback means) is itself ambiguous.

1b — Known State Definition

Recovery means returning to a known good state. Define that state before choosing mechanisms:

What constitutes the intended state of this component? (Version number, cryptographic hash of the image, configuration values, firmware parameters)
Is the intended state captured and continuously monitored against the deployed state?
Does state tracking cover in-memory state (daemon configuration loaded at startup) as well as on-disk state?

Why: The more thoroughly the intended state is encoded and compared against the actual state, the easier it is to detect deviations, trigger automated repair, and confirm recovery is complete. Systems that do not know their intended state cannot verify that recovery succeeded.

Output for this step: Component type classification and intended state definition.

Step 2 — Select Rollback Control Mechanism

For self-updating components (and application software where security patch coverage is required), a bare "allow arbitrary rollbacks" policy reintroduces known vulnerabilities. A bare "never allow rollbacks" policy removes the path to a stable state when a bad update is deployed. The three mechanisms below provide controlled middle ground.

Prerequisite for all three mechanisms: Updates are cryptographically signed and the signature covers the component image and its version metadata.

The Three-Mechanism Comparison

Mechanism	How it works	Best for	Key limitation
---	---	---	---
Key rotation	Rotating the update signing key invalidates older releases that were signed with the old key	Long-term hygiene; recovering from signing key compromise	Sudden rotation disrupts reliability; gradual rotation (dual-key overlap) adds complexity; hardware devices with OTP key storage have very limited rotation budget
Deny list	A list of version identifiers (hashes or labels) that components refuse to install	Quick incident response; blocking specific bad versions	Vulnerable to "unzipping" (incremental rollback through versions not yet on the list); list grows without bound; garbage collection is operationally complex
Minimum acceptable security version number (downgrade prevention)	Each release carries a security version number (SVN); a separately maintained floor value (stored in component state, not inside the release) prevents installation of releases with an SVN below the floor	Permanent exclusion of vulnerable version ranges; garbage-collecting deny list entries	Floor value is ratcheted forward by new releases (not by humans directly); requires ordered version values (not cryptographic hashes)

Mechanism Selection Decision Framework

Start with key rotation — all healthy organizations should implement key rotation regardless of the other choices. It is foundational cryptographic hygiene and validates the rotation process before it is needed in an emergency.

Add deny lists for incident response velocity. During an active incident, quickly appending a version identifier to a deny list is faster than raising the minimum acceptable security version number floor. Use deny lists as a rapid-response tool; use downgrade prevention as the cleanup mechanism afterward.

Add downgrade prevention (minimum acceptable security version number) when:

The deny list has grown large enough to be operationally burdensome
You want to permanently exclude all versions below a security milestone, rather than maintaining individual entries
You are designing the system from scratch and want the cleanest long-term architecture

For self-updating components, combine all three where feasible, but introduce one at a time. Each mechanism adds operational complexity. Introducing all three simultaneously makes it difficult to identify bugs or corner cases in any individual mechanism.

Implementation Notes

Deny list storage: Encode the deny list outside the self-updating component (ComponentState[DenyList], not Release[DenyList]). A list stored inside the component is lost during updates and does not survive component replacement. The external list is the union of entries from all releases installed on that component.

Minimum acceptable security version number floor storage: Store the floor value in external component state (ComponentState[MASVN]), not inside any release. The floor value is ratcheted forward automatically: each time a new release initializes, it compares its own MASVN against the stored floor and updates the floor to the higher value. This means the floor only rises — it cannot be lowered by installing an older release.

Key rotation for hardware: Hardware devices with one-time-programmable (OTP) memory for storing public keys have a very limited key storage budget. Plan the number of key rotations needed across the device's lifetime before finalizing the hardware design. For spare parts that may have old firmware, support multiple signatures per release (old key and new key simultaneously) during the rotation transition window.

Output for this step: Selected mechanism(s) per component, with rationale. Implementation notes for deny list storage location and minimum acceptable security version number floor management.

Step 3 — Design the Rate-of-Change Policy

Recovery speed and recovery risk are in direct tension: deploying changes faster reduces the window of exposure, but also reduces the time available for testing and increases the risk of deploying a broken patch. The correct architectural response is to decouple the ability to change quickly from the policy governing how quickly changes happen.

3a — Build Update Velocity Independent of Update Policy

Design the update system to operate at the maximum conceivable speed. Then add a separate, independent rate-limiting service that constrains the actual rate of change according to current policy.

Why: If the rollout system's speed is determined by its internal policy parameters, then changing the policy requires modifying the rollout system — which is a code change subject to its own review, build, and release cycle. This makes emergency acceleration dependent on the system that is already under stress. Separating rate from action means that responding to an emergency requires only changing a rate limit, not rewriting infrastructure.

Design the rate-limiting service as:

An independent, standalone microservice with minimal dependencies
The single point of authority for approving changes at a given rate
A collector of change logs for auditing
Simple enough to test rigorously in isolation

Rate-limit token design: The rate-limiting service issues short-lived cryptographic tokens asserting that it has reviewed and approved a change at a certain time. The rollout actuator validates these tokens before applying changes. This communicates architecturally that change actuation is decoupled from change rate governance.

3b — Backstop Rate Limit for Epoch/Version Advancement

If the system uses epoch or version advancement (see Step 4 on time dependencies), hardcode a backstop rate limit for how frequently the epoch or version can advance — for example, no more than once per second for a 64-bit integer counter. This is an exception to the "make policy external" principle, but is justified: it is difficult to imagine a legitimate reason to advance system state more than once per second, and the backstop prevents an adversary with temporary system control from forcing an epoch rollover that renders the version floor mechanism useless.

Output for this step: Rate-limiting service design, token mechanism, and backstop rate limit value for epoch/version advancement if applicable.

Step 4 — Eliminate Wall-Clock Time Dependencies

Wall-clock time is a form of external state that the system cannot control. Recovery operations — replaying transactions, validating certificates, correlating logs across systems — are all vulnerable to clock skew, misconfigured NTP, certificate expiration edge cases, and leap-second bugs.

The fundamental problem: When recovering from a crash or rolling back to a previous state, any system component that checks wall-clock time may find that time has moved in unexpected ways. A recovery involving digitally signed transactions may fail if those transactions were signed by certificates that have since expired. Rolling back a database may require transaction replay — which fails if the database expects monotonically increasing timestamps.

4a — Replace Wall-Clock Time with Rate-Based or Epoch-Based Alternatives

For each place in the system where wall-clock time is used, evaluate replacement with one of:

Alternative	How it works	When to use
---	---	---
Rates	Express time-dependent limits as rates (events per second) rather than absolute timestamps	For throttling, rate limiting, and frequency policies
Epoch or version advancement	A monotonically increasing integer counter that advances by explicit policy action, not by clock progression	For validity windows, certificate freshness, ordering of events across distributed components
Validity lists	Explicit lists of valid-versus-revoked items, pushed periodically by a controlled process	For certificate validity, credential validity, and key status (see Step 5 for revocation design)

Why epoch advancement beats wall-clock time for validity: With epoch advancement, certificates age without the system being tempted to skip certificate validity checking. Because you can halt epoch advancement in an emergency (freeze the epoch counter while investigating an issue), and because the validity system depends on your controlled push mechanism rather than on coordinated clocks across the fleet, recovery is not complicated by time.

Caution on epoch rollover: Aggressively incremented epoch values can roll over or overflow. Choose a sufficiently large integer range (64-bit) and apply a backstop rate limit (see Step 3b) to prevent an adversary from forcing rollover.

4b — Audit for Wall-Clock Time Code Smells

Flag for removal or replacement:

Fixed dates or time offsets in code (a "time bomb" — the code changes behavior at a specific real-world date without any human action)
Certificate expiration times set far in the future ("not our problem anymore" engineering)
Unauthenticated NTP dependencies (an attacker who controls NTP can manipulate time-based validity)
Database recovery processes that require monotonically increasing timestamps (vulnerable to rollback-induced time inversion)

Exception: Wall-clock time is appropriate for deliberately time-bounded access — for example, requiring employees to re-authenticate daily, or enforcing a time-limited access grant. In these cases, design a repair path for the time-validation system that does not itself depend on wall-clock time.

Output for this step: List of wall-clock time dependencies in the system, replacements selected for each, and any legitimate exceptions with a documented repair path.

Step 5 — Design the Revocation Mechanism

A revocation mechanism stops access by entities whose credentials have been compromised. It is most valuable during an active compromise — and that is precisely when it is most likely to be exercised under stress and with incomplete information. Design it before the incident.

5a — Choose the Distribution Model

Model	How it works	Risk
---	---	---
Centralized validity database	Every system checks credential validity against a central database before allowing access	If the database is down, all dependent systems fail; there is strong temptation to fail open, which creates an attack vector (denial-of-service the validity database to revalidate revoked credentials)
Locally cached revocation list	A revocation list is pushed periodically to all nodes; nodes use their local cache to make validity decisions	Nodes proceed on best-available information; revocation is eventually consistent rather than instantaneous; no single point of failure for access

Recommendation: Prefer locally cached revocation list distribution over a centralized validity database. The centralized database creates a single point of failure and introduces the temptation to fail open. Distributing revocation data to nodes gives each node independence while maintaining eventual revocation of compromised credentials.

5b — Define Safe Failure Behavior

If the revocation service is unavailable or a node cannot reach it:

Do not fail open. Failing open when the revocation service is unavailable lets an attacker who conducts a denial-of-service attack on the revocation service reuse revoked credentials during the outage.
Use the cached list. Nodes should proceed based on their most recent cached revocation list, with monitoring to detect nodes that have not received a recent update.

5c — Revocation at Scale: Self-Protection During KRL Updates

When updating a Key Revocation List (KRL) file, a naive implementation — blindly replacing the old file with the new one — allows a single push to revoke every valid credential in the infrastructure. An attacker with partial system control can use this against you.

Safeguard: Require each node to evaluate a new KRL before applying it. A KRL that would revoke the node's own credentials is refused. This guarantees that even a malicious KRL push can affect at most half of the fleet, preserving a recovery base.

Why this matters for recovery: Half-machine protection means the worst-case blast radius of a revocation push is bounded. You can recover the remaining half of the infrastructure and use it to remediate the first half — far easier than recovering everything from scratch.

5d — Avoid Special-Purpose Emergency Revocation Lists

The temptation during incident response is to build a separate, faster emergency revocation list to supplement the normal one. Avoid this: rarely-used mechanisms are less likely to work when most needed, and a separate emergency mechanism adds complexity without proportional benefit.

Instead: Shard the normal revocation list. Revoking a credential during an emergency requires updating only the relevant shard, not the entire list. Because the system always uses a multi-part revocation list (even under normal conditions), the mechanism is exercised regularly and is reliable under emergency conditions.

5e — Remove Wall-Clock Time from Certificate Validation

Explicit revocation removes the dependency on accurate time for certificate validation. A certificate that is "expired" by wall-clock time but present on no revocation list is effectively valid under a pure revocation model. This eliminates the failure mode where clock skew accidentally invalidates legitimate certificates during a recovery, or where an attacker manipulates clocks to revalidate revoked certificates.

Output for this step: Revocation distribution model, safe failure behavior definition, self-protection safeguard for KRL updates, sharding strategy, and whether wall-clock time is eliminated from certificate validity.

Step 6 — Design Emergency Access Infrastructure

When normal access methods are completely unavailable — the SSO service is down, the VPN is broken, the primary credential service is unreachable — responders must still be able to act. Emergency access infrastructure is what makes recovery possible when the access path itself has failed.

6a — Access Controls

Emergency access cannot depend on the same dynamic credential services as normal access. Design an alternative access path that:

Uses offline-provisioned credentials (not derived from SSO or federated identity providers that may be unavailable during the outage)
Achieves equivalent access policy security, even if with reduced convenience or features
Is restricted to the minimum set of people who need it immediately (limit attack surface, operational cost, and usability degradation)
Has pre-provisioned credentials with explicitly managed lifetime — credentials issued proactively on a fixed schedule (not activated on demand at incident start) avoid the race condition where a credential expires just as an outage begins

Credential lifetime tradeoff: Short-lived emergency credentials enforce good security hygiene, but if the outage outlasts the credential lifetime, responders are locked out. Set credential lifetime to exceed the anticipated maximum outage duration, even though this extends the window during which a compromised credential is valid.

6b — Network

Emergency network access must survive failures in the layers outside your control:

Prefer static network access controls over dynamic protocols (software-defined networking and dynamic routing protocols have dependencies that may be unavailable during an outage)
Implement sufficient monitoring to detect where inside the network access breaks — not just that it is broken, but at which layer
For self-contained emergency access, deploy geographically distributed access points so that responders in different regions can independently reach their nearest rack and radiate recovery outward, rather than requiring global coordination

Why geographic distribution matters: During a global outage, global coordination may be impossible. If each geographic region can begin recovery independently on the infrastructure it can reach, recovery propagates outward from working points rather than waiting for a globally coordinated restart.

6c — Communications

Emergency communication channels must be evaluated for availability when the primary chat or collaboration service is itself unavailable or compromised:

Select a communication technology with as few dependencies as possible — the tool must be reachable by responders even if systems outside your control are broken
If the communication system is outsourced, confirm that it is reachable even when your infrastructure layers are broken
Consider: is the system reachable if an attacker is eavesdropping on your network? Does it provide sufficient authentication and confidentiality for incident response use?

6d — Responder Habits and Continuous Validation

Emergency access technology provides no benefit if responders do not know how to use it under stress. The cognitive load of switching to unfamiliar tools during a high-stress incident can render emergency access effectively unusable.

Minimize the distinction between normal and emergency processes: When emergency access uses the same underlying tools as normal access (for example, an emergency mode added to the browser extension used for normal remote access), responders can draw on habit rather than recall during an incident.

Require regular exercises:

Define a minimum period between required emergency access exercises for each responder
Integrate emergency access procedures into normal on-call duties where possible
Track credential-refresh and training completion; alert when a responder's emergency credentials are approaching expiration
Make practicing emergency access mandatory if regular equivalent activity is not occurring

Why validation is non-optional: Humans, not technology, are most likely to render emergency access ineffective. A responder who has not used the emergency access path in six months and has let their credentials expire provides no incident response capacity, even if the infrastructure is sound.

Output for this step: Emergency access design covering access controls (credential model, provisioning, lifetime), network (static vs. dynamic, geographic distribution), communications (tool selection, dependency audit), and responder validation plan.

Step 7 — Produce the Recovery Mechanism Design Document

Synthesize findings from Steps 1–6 into a structured document.

Document sections:

Component inventory and types — which components are application software, system software, or firmware; whether updates are signed; intended state definition per component
Rollback control strategy — selected mechanism(s) per component (key rotation / deny list / downgrade prevention), rationale, and implementation notes (storage location for deny list and version floor)
Rate-of-change policy — rate-limiting service design, token mechanism, backstop rate limit, and emergency acceleration procedure (change rate limit only, not rollout infrastructure)
Time dependency audit — wall-clock time dependencies identified, replacements selected (rates / epoch advancement / validity lists), legitimate exceptions with repair path
Revocation mechanism — distribution model, safe failure behavior, KRL self-protection safeguard, sharding strategy, wall-clock time elimination decision
Emergency access plan — access control credential model and lifetime, network access design, communications tool selection, responder exercise schedule and validation plan

Key Principles

The Rollback Policy Triangle

Three rollback policies exist, and only one is correct:

Allow arbitrary rollbacks: Insecure. Any rollback may reintroduce a known vulnerability. Older versions have more stable, weaponized exploits.
Never allow rollbacks: Unreliable. A bad update cannot be undone. The build system must always generate a valid forward target.
Controlled rollback with mechanism: Deploy one or more of the three mechanisms to hold the floor while preserving the ability to recover from bad updates.

No organization should accept either extreme for production systems.

Emergency Systems Must Be Normal Systems

Emergency push systems that are separate from the normal rollout pipeline will not work when needed, because they are never exercised. Design the emergency push path to be the normal push path operating at maximum rate with adjusted rate-limit policy. The same principle applies to emergency access — use the same platform with an emergency mode rather than a separate tool.

Rate Separates Speed from Content

The rate-of-change policy is not part of the update mechanism itself. It is a separate service. This separation is what allows a security team to respond to an emergency by adjusting a rate limit rather than rewriting and redeploying rollout infrastructure during the incident.

Revocation Must Not Fail Open

A revocation mechanism that fails open under load is an attack vector, not a safety mechanism. A distributed denial-of-service attack on the revocation service reactivates all revoked credentials for the duration of the attack. Use locally cached revocation state so that node decisions are independent of real-time service availability.

Know the Intended State

Recovery from any error category — random, accidental, malicious, or software — requires returning to a known good state. The prerequisite for recovery is knowing what that state is. Encode the intended state per service, per host, per device. Monitor deviations continuously. Recovery is then the automated or manual process of repairing deviations — not a heroic reconstruction from incomplete memory.

Examples

Example 1 — Rate-Limiting Service Decoupling (Google Linux Distribution Rollout)

Google initially built rollout tooling around a monthly cadence for its internal Linux base image, bundling timing and content together. As security patches multiplied (each package needing its own patch cadence), the monthly assumption broke the tooling's design.

The solution separated three concerns: the rollout rate and timing, the configuration store defining each machine's target state, and the rollout actuator applying updates per machine. Each concern was developed and updated independently. Emergency releases became a matter of adjusting rate limits on the existing rollout service, not rewriting infrastructure. The result: simpler, more useful, and safer — and the same tooling handles both normal cadence and emergency releases.

Example 2 — Minimum Acceptable Security Version Number in a Three-Release Sequence

Release i−1 runs with SVN 4, MASVN 4. A vulnerability is discovered.
Release i ships SVN 5 (security patch), but MASVN remains 4. The patch is deployed and proven stable.
Release i+1 ships SVN 5, MASVN 5. Once deployed, ComponentState[MASVN] advances to 5. Release i−1 (SVN 4) can no longer be installed on any component that has received release i+1.

Effect: The security patch is now mandatory. Rollback to the vulnerable version is permanently blocked — on a per-component basis, without global coordination.

Example 3 — Revocation List Self-Protection During KRL Update

A KRL push is intercepted or corrupted such that the new KRL would revoke every valid SSH credential in the infrastructure. Without self-protection, a single push takes down the entire fleet.

With self-protection: each node evaluates the incoming KRL before applying it. Any KRL that would revoke the node's own credentials is refused. The worst-case outcome is that half of the fleet refuses the KRL push. The other half remains functional and can be used as a recovery base to remediate the first half.

Example 4 — Emergency Access with Locally Isolated Credentials

Google's corporate network uses SSO, short-term credentials, and multi-party authorization. A failure in any of these components could prevent all employee remote access, including incident responders.

To address this, offline credentials were provisioned and alternative authentication algorithms deployed that do not depend on SSO. These credentials are pre-provisioned on a fixed schedule (not activated on demand), ensuring they are valid at incident start rather than just being created. They are restricted to the minimum set of responders who need immediate access, while the broader organization waits for the normal access control services to be restored.

References

Chapter 9: Design for Recovery, Building Secure and Reliable Systems (pp. 183–215) — primary source for all mechanisms in this skill
Chapter 8: Design for Resilience — compartmentalization, failure domains, and low-dependency component design (referenced by resilience-and-blast-radius-design)
Chapter 7: Design for a Changing Landscape — rate-of-change tradeoffs during security vulnerability response
Chapter 17: Identifying and Responding to Incidents — when to use revocation during active compromise
Chapter 18: Recovery and Aftermath — full complexity of recovery from a serious targeted compromise
Site Reliability Engineering book, Chapter 7 — automation and host management patterns for intended state tracking
See references/rollback-mechanism-comparison.md for extended pseudocode examples for deny list and downgrade prevention implementation
See references/emergency-access-checklist.md for a per-responder emergency credential provisioning and exercise checklist

Cross-references:

resilience-and-blast-radius-design — failure domain design, compartmentalization axes, and low-dependency component patterns that emergency access must satisfy
security-change-rollout-planning — rate-of-change tradeoffs during security patch deployment and rollout acceleration triggers

License

This skill is licensed under CC-BY-SA-4.0.

Source: BookForge — Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Piotr Lewandowski, Ana Oprea, Adam Stubblefield.

Related BookForge Skills

Install related skills from ClawhHub:

clawhub install bookforge-resilience-and-blast-radius-design

Or install the full book set from GitHub: bookforge-skills

Recovery Mechanism Design

概述

Recovery Mechanism Design

When to Use

Context and Input Gathering

Required Context

Observable Context

Default Assumptions

Process

Step 1 — Classify the Component and Its Recovery Context

1a — Component Type

1b — Known State Definition

Step 2 — Select Rollback Control Mechanism

The Three-Mechanism Comparison

Mechanism Selection Decision Framework

Implementation Notes

Step 3 — Design the Rate-of-Change Policy

3a — Build Update Velocity Independent of Update Policy

3b — Backstop Rate Limit for Epoch/Version Advancement

Step 4 — Eliminate Wall-Clock Time Dependencies

4a — Replace Wall-Clock Time with Rate-Based or Epoch-Based Alternatives

4b — Audit for Wall-Clock Time Code Smells

Step 5 — Design the Revocation Mechanism

5a — Choose the Distribution Model

5b — Define Safe Failure Behavior

5c — Revocation at Scale: Self-Protection During KRL Updates

5d — Avoid Special-Purpose Emergency Revocation Lists

5e — Remove Wall-Clock Time from Certificate Validation

Step 6 — Design Emergency Access Infrastructure

6a — Access Controls

6b — Network

6c — Communications

6d — Responder Habits and Continuous Validation

Step 7 — Produce the Recovery Mechanism Design Document

Key Principles

The Rollback Policy Triangle

Emergency Systems Must Be Normal Systems

Rate Separates Speed from Content

Revocation Must Not Fail Open

Know the Intended State

Examples

Example 1 — Rate-Limiting Service Decoupling (Google Linux Distribution Rollout)

Example 2 — Minimum Acceptable Security Version Number in a Three-Release Sequence

Example 3 — Revocation List Self-Protection During KRL Update

Example 4 — Emergency Access with Locally Isolated Credentials

References

License

Related BookForge Skills

版本历史

安全检测