← 返回
安全合规 中文

Production Readiness

Meta-skill that orchestrates logging, monitoring, error handling, performance, security, deployment, and testing skills to ensure a service is fully production-ready before launch. Use before first deploy, major releases, quarterly reviews, or after incidents.
元技能:编排日志、监控、错误处理、性能、安全、部署与测试,确保服务在上线前完全就绪。适用于首次部署、重大版本、季度评审或事故后。
wpank
安全合规 clawhub v1.0.0 1 版本 99825.3 Key: 无需
★ 0
Stars
📥 1,143
下载
💾 11
安装
1
版本
#latest

概述

Production Readiness (Meta-Skill)

Coordinates all operational concerns into a single readiness review. Instead of duplicating domain expertise, this skill routes to specialized skills and agents for each area, then synthesizes results into a unified go/no-go assessment.

Installation

OpenClaw / Moltbot / Clawbot

npx clawhub@latest install production-readiness

Purpose

Ensure a service is production-ready by systematically checking every operational concern — logging, error handling, performance, security, deployment, testing, and documentation — before traffic hits it.

A production-ready service:

  • Fails gracefully under load and partial outages
  • Observes itself with structured logs, metrics, and traces
  • Recovers automatically from transient failures
  • Communicates health to orchestrators and operators
  • Documents operations so on-call engineers can respond without tribal knowledge

When to Use

TriggerContext
------------------
Before first deployNew service going to production for the first time
Before major releaseSignificant feature or architectural change shipping
Quarterly production reviewScheduled audit of existing services
After incidentPost-incident hardening to prevent recurrence
Dependency upgradeMajor framework, runtime, or infrastructure change
Team handoffTransferring ownership of a service to another team

Orchestration Flow

Run each area sequentially or in parallel. Each step delegates to a specialized skill or agent — this skill does not re-implement their logic.

┌─────────────────────────────────────────────────┐
│              Production Readiness Review         │
├─────────────────────────────────────────────────┤
│                                                  │
│  1. Logging & Observability ──► logging-observability skill
│  2. Error Handling ───────────► error-handling-patterns skill
│  3. Performance ──────────────► performance-agent
│  4. Security ─────────────────► security-review meta-skill
│  5. Deployment ───────────────► deployment-agent + docker-expert skill
│  6. Testing ──────────────────► testing-workflow meta-skill
│  7. Documentation ────────────► /generate-docs command
│                                                  │
│  ──► Synthesize results into go/no-go report     │
└─────────────────────────────────────────────────┘

Step Details

  1. Logging & Observability — Structured logging, log levels, correlation IDs, metrics endpoints, distributed tracing, alerting rules
  2. Error Handling — Global error boundaries, retry policies, dead-letter queues, error classification, user-facing error messages
  3. Performance — Load testing results, P95/P99 latency baselines, memory/CPU profiling, database query analysis, caching strategy
  4. Security — Auth/authz verification, input validation, dependency audit, secrets management, OWASP top-10 review
  5. Deployment — Container hardening, rollback strategy, blue-green/canary configuration, infrastructure-as-code review
  6. Testing — Unit/integration/e2e coverage, contract tests, chaos/failure injection, smoke tests in staging
  7. Documentation — API docs, runbooks, architecture diagrams, on-call playbooks, ADRs for key decisions

Skill Routing Table

ConcernSkill / AgentPath
-----------------------------
Logging & Observabilitylogging-observabilityai/skills/tools/logging-observability/SKILL.md
Error Handlingerror-handling-patternsai/skills/backend/error-handling-patterns/SKILL.md
Performanceperformance-agentai/agents/performance/
Securitysecurity-reviewai/skills/meta/security-review/SKILL.md
Deployment (containers)docker-expertai/skills/devops/docker/SKILL.md
Deployment (pipelines)deployment-agentai/agents/deployment/
Testingtesting-workflowai/skills/testing/testing-workflow/SKILL.md
Rate Limitingrate-limiting-patternsai/skills/backend/rate-limiting-patterns/SKILL.md
Documentation/generate-docsai/commands/documentation/

> Routing rule: Read the target skill first, follow its instructions, then return results here for synthesis.


Production Readiness Checklist

Health & Lifecycle

  • [ ] Health check endpoint (/healthz or /health) returns dependency status
  • [ ] Readiness probe distinguishes "starting" from "ready to serve"
  • [ ] Liveness probe detects deadlocks and unrecoverable states
  • [ ] Graceful shutdown drains in-flight requests before exit
  • [ ] Startup probe allows for slow initialization without false restarts

Resilience

  • [ ] Circuit breakers on all external service calls
  • [ ] Retry with exponential backoff and jitter on transient failures
  • [ ] Rate limiting configured per endpoint and per client
  • [ ] Backpressure mechanisms prevent cascade failures under load
  • [ ] Timeouts set on every outbound call (HTTP, DB, queue)
  • [ ] Bulkhead isolation separates critical from non-critical paths

Configuration & Secrets

  • [ ] All configuration externalized (env vars, config service, or feature flags)
  • [ ] No secrets in code, images, or environment variable defaults
  • [ ] Secrets loaded from a vault (e.g., AWS Secrets Manager, HashiCorp Vault)
  • [ ] Configuration changes do not require redeployment
  • [ ] Feature flags in place for high-risk changes

Data Safety

  • [ ] Backup strategy defined and tested (RPO/RTO documented)
  • [ ] Restore procedure verified in a non-production environment
  • [ ] Database migrations are backward-compatible and reversible
  • [ ] Data retention policies implemented and enforced

Operational Readiness

  • [ ] Runbooks exist for top 5 most likely failure scenarios
  • [ ] SLOs defined (availability, latency, error rate) with error budgets
  • [ ] SLAs communicated to dependent teams or customers
  • [ ] On-call rotation staffed and escalation path documented
  • [ ] Dashboards show golden signals (latency, traffic, errors, saturation)
  • [ ] Alerting rules configured with appropriate thresholds and severity

Maturity Levels

LevelNameRequirements
--------------------------
L1MVPHealth check, basic logging, error handling, manual deploy, unit tests, README
L2StableStructured logging, metrics, graceful shutdown, CI/CD pipeline, integration tests, runbooks
L3ResilientDistributed tracing, circuit breakers, auto-scaling, chaos testing, SLOs, on-call rotation
L4OptimizedAdaptive rate limiting, predictive alerting, canary deploys, full observability, error budgets, postmortem culture

Progression Guidance

  • L1 → L2: Add structured logging, metrics endpoint, and a CI/CD pipeline. Write runbooks for known failure modes.
  • L2 → L3: Instrument distributed tracing. Add circuit breakers to external calls. Define SLOs and set up on-call.
  • L3 → L4: Implement canary deployments. Adopt error budgets. Run regular game days. Build predictive alerting.

Incident Response

On-Call Rotation

  • Minimum two engineers per rotation (primary + secondary)
  • Handoff includes review of recent deploys, open issues, and known risks
  • Escalation targets defined: primary → secondary → engineering lead → VP Eng

Escalation Matrix

SeverityResponse TimeEscalation AfterStakeholder Notification
---------------------------------------------------------------------
SEV-1 (outage)15 min30 minImmediate — exec + customers
SEV-2 (degraded)30 min1 hourWithin 1 hour — eng lead
SEV-3 (minor)4 hoursNext business dayDaily standup
SEV-4 (cosmetic)Next sprintN/ABacklog

Postmortem Template

## Incident: [Title]
**Date:** YYYY-MM-DD | **Duration:** X hours | **Severity:** SEV-N

### Summary
One-paragraph description of what happened and impact.

### Timeline
- HH:MM — First alert fired
- HH:MM — Engineer paged, investigation started
- HH:MM — Root cause identified
- HH:MM — Mitigation applied
- HH:MM — Full resolution confirmed

### Root Cause
What broke and why. Link to code/config change if applicable.

### Impact
- Users affected: N
- Revenue impact: $X (if applicable)
- SLO budget consumed: X%

### Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Fix X  | @eng  | YYYY-MM-DD | Open |

### Lessons Learned
- What went well
- What went poorly
- Where we got lucky

NEVER Do

  1. NEVER skip health checks — every service must expose health endpoints; no exceptions for "simple" services
  2. NEVER store secrets in code or container images — use a secrets manager; never default env vars with real values
  3. NEVER deploy without a rollback plan — if you cannot roll back in under 5 minutes, you are not ready to deploy
  4. NEVER ignore error budget violations — when the error budget is exhausted, freeze feature work and fix reliability
  5. NEVER treat logging as optional — a service without structured logging is a service you cannot debug at 3 AM
  6. NEVER go to production without runbooks — if on-call cannot resolve the top 5 failure modes without the original author, the service is not production-ready

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 02:23 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

developer-tools

Code Review

wpank
涵盖安全、性能、可维护性、正确性和测试的系统化代码审查模式,包含严重等级、结构化反馈指南、审查流程及需避免的反模式。适用于审查 PR、建立审查标准或提升审查质量。
★ 31 📥 17,085
security-compliance

MoltGuard - Security & Antivirus & Guardrails

thomaslwang
MoltGuard — OpenClaw 安全守卫,由 OpenGuardrails 提供。安装 MoltGuard,保护您和您的用户免受提示注入、数据泄露和恶意攻击。
★ 116 📥 30,709
security-compliance

OpenClaw Backup

alex3alex
备份与恢复 OpenClaw 数据。适用于创建备份、设置自动备份计划、从备份恢复或管理备份轮转。处理 ~/.openclaw 目录归档并包含适当的排除规则。
★ 89 📥 30,598