Runbook Generator

Tier: POWERFUL

Category: Engineering

Domain: DevOps / Site Reliability Engineering

Overview

Analyze a codebase and generate production-grade operational runbooks. Detects your stack (CI/CD, database, hosting, containers), then produces step-by-step runbooks with copy-paste commands, verification checks, rollback procedures, escalation paths, and time estimates. Keeps runbooks fresh with staleness detection linked to config file modification dates.

Core Capabilities

Stack detection — auto-identify CI/CD, database, hosting, orchestration from repo files
Runbook types — deployment, incident response, database maintenance, scaling, monitoring setup
Format discipline — numbered steps, copy-paste commands, ✅ verification checks, time estimates
Escalation paths — L1 → L2 → L3 with contact info and decision criteria
Rollback procedures — every deployment step has a corresponding undo
Staleness detection — runbook sections reference config files; flag when source changes
Testing methodology — dry-run framework for staging validation, quarterly review cadence

When to Use

Use when:

A codebase has no runbooks and you need to bootstrap them fast
Existing runbooks are outdated or incomplete (point at the repo, regenerate)
Onboarding a new engineer who needs clear operational procedures
Preparing for an incident response drill or audit
Setting up monitoring and on-call rotation from scratch

Skip when:

The system is too early-stage to have stable operational patterns
Runbooks already exist and only need minor updates (edit directly)

Stack Detection

When given a repo, scan for these signals before writing a single runbook line:

# CI/CD
ls .github/workflows/     → GitHub Actions
ls .gitlab-ci.yml         → GitLab CI
ls Jenkinsfile            → Jenkins
ls .circleci/             → CircleCI
ls bitbucket-pipelines.yml → Bitbucket Pipelines

# Database
grep -r "postgresql\|postgres\|pg" package.json pyproject.toml → PostgreSQL
grep -r "mysql\|mariadb"           package.json               → MySQL
grep -r "mongodb\|mongoose"        package.json               → MongoDB
grep -r "redis"                    package.json               → Redis
ls prisma/schema.prisma            → Prisma ORM (check provider field)
ls drizzle.config.*                → Drizzle ORM

# Hosting
ls vercel.json                     → Vercel
ls railway.toml                    → Railway
ls fly.toml                        → Fly.io
ls .ebextensions/                  → AWS Elastic Beanstalk
ls terraform/  ls *.tf             → Custom AWS/GCP/Azure (check provider)
ls kubernetes/ ls k8s/             → Kubernetes
ls docker-compose.yml              → Docker Compose

# Framework
ls next.config.*                   → Next.js
ls nuxt.config.*                   → Nuxt
ls svelte.config.*                 → SvelteKit
cat package.json | jq '.scripts'   → Check build/start commands

Map detected stack → runbook templates. A Next.js + PostgreSQL + Vercel + GitHub Actions repo needs:

Deployment runbook (Vercel + GitHub Actions)
Database runbook (PostgreSQL backup, migration, vacuum)
Incident response (with Vercel logs + pg query debugging)
Monitoring setup (Vercel Analytics, pg_stat, alerting)

Runbook Types

1. Deployment Runbook

# Deployment Runbook — [App Name]
**Stack:** Next.js 14 + PostgreSQL 15 + Vercel  
**Last verified:** 2025-03-01  
**Source configs:** vercel.json (modified: git log -1 --format=%ci -- vercel.json)  
**Owner:** Platform Team  
**Est. total time:** 15–25 min  

---

## Pre-deployment Checklist
- [ ] All PRs merged to main
- [ ] CI passing on main (GitHub Actions green)
- [ ] Database migrations tested in staging
- [ ] Rollback plan confirmed

## Steps

### Step 1 — Run CI checks locally (3 min)

pnpm test

pnpm lint

pnpm build

✅ Expected: All pass with 0 errors. Build output in `.next/`

### Step 2 — Apply database migrations (5 min)

Staging first

DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy

✅ Expected: `All migrations have been successfully applied.`

Verify migration applied

psql $STAGING_DATABASE_URL -c "\d" | grep -i migration

✅ Expected: Migration table shows new entry with today's date

### Step 3 — Deploy to production (5 min)

git push origin main

OR trigger manually:

vercel --prod

✅ Expected: Vercel dashboard shows deployment in progress. URL format:
`https://app-name-<hash>-team.vercel.app`

### Step 4 — Smoke test production (5 min)

Health check

curl -sf https://your-app.vercel.app/api/health | jq .

Critical path

curl -sf https://your-app.vercel.app/api/users/me \

-H "Authorization: Bearer $TEST_TOKEN" | jq '.id'

✅ Expected: health returns `{"status":"ok","db":"connected"}`. Users API returns valid ID.

### Step 5 — Monitor for 10 min
- Check Vercel Functions log for errors: `vercel logs --since=10m`
- Check error rate in Vercel Analytics: < 1% 5xx
- Check DB connection pool: `SELECT count(*) FROM pg_stat_activity;` (< 80% of max_connections)

---

## Rollback

If smoke tests fail or error rate spikes:

Instant rollback via Vercel (preferred — < 30 sec)

vercel rollback [previous-deployment-url]

Database rollback (only if migration was applied)

DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate reset --skip-seed

WARNING: This resets to previous migration. Confirm data impact first.


✅ Expected after rollback: Previous deployment URL becomes active. Verify with smoke test.

---

## Escalation
- **L1 (on-call engineer):** Check Vercel logs, run smoke tests, attempt rollback
- **L2 (platform lead):** DB issues, data loss risk, rollback failed — Slack: @platform-lead
- **L3 (CTO):** Production down > 30 min, data breach — PagerDuty: #critical-incidents

2. Incident Response Runbook

# Incident Response Runbook
**Severity levels:** P1 (down), P2 (degraded), P3 (minor)  
**Est. total time:** P1: 30–60 min, P2: 1–4 hours  

## Phase 1 — Triage (5 min)

### Confirm the incident

Is the app responding?

curl -sw "%{http_code}" https://your-app.vercel.app/api/health -o /dev/null

Check Vercel function errors (last 15 min)

vercel logs --since=15m | grep -i "error\|exception\|5[0-9][0-9]"

✅ 200 = app up. 5xx or timeout = incident confirmed.

Declare severity:
- Site completely down → P1 — page L2/L3 immediately
- Partial degradation / slow responses → P2 — notify team channel
- Single feature broken → P3 — create ticket, fix in business hours

---

## Phase 2 — Diagnose (10–15 min)

Recent deployments — did something just ship?

vercel ls --limit=5

Database health

psql $DATABASE_URL -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state != 'idle' LIMIT 20;"

Long-running queries (> 30 sec)

psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';"

Connection pool saturation

psql $DATABASE_URL -c "SELECT count(*), max_conn FROM pg_stat_activity, (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') t GROUP BY max_conn;"


Diagnostic decision tree:
- Recent deploy + new errors → rollback (see Deployment Runbook)
- DB query timeout / pool saturation → kill long queries, scale connections
- External dependency failing → check status pages, add circuit breaker
- Memory/CPU spike → check Vercel function logs for infinite loops

---

## Phase 3 — Mitigate (variable)

Kill a runaway DB query

psql $DATABASE_URL -c "SELECT pg_terminate_backend();"

Scale DB connections (Supabase/Neon — adjust pool size)

Vercel → Settings → Environment Variables → update DATABASE_POOL_MAX

Enable maintenance mode (if you have a feature flag)

vercel env add MAINTENANCE_MODE true production

vercel --prod # redeploy with flag


---

## Phase 4 — Resolve & Postmortem

After incident is resolved, within 24 hours:

1. Write incident timeline (what happened, when, who noticed, what fixed it)
2. Identify root cause (5-Whys)
3. Define action items with owners and due dates
4. Update this runbook if a step was missing or wrong
5. Add monitoring/alert that would have caught this earlier

**Postmortem template:** `docs/postmortems/YYYY-MM-DD-incident-title.md`

---

## Escalation Path

| Level | Who | When | Contact |
|-------|-----|------|---------|
| L1 | On-call engineer | Always first | PagerDuty rotation |
| L2 | Platform lead | DB issues, rollback needed | Slack @platform-lead |
| L3 | CTO/VP Eng | P1 > 30 min, data loss | Phone + PagerDuty |

3. Database Maintenance Runbook

# Database Maintenance Runbook — PostgreSQL
**Schedule:** Weekly vacuum (automated), monthly manual review  

## Backup

Full backup

pg_dump $DATABASE_URL \

--format=custom \

--compress=9 \

--file="backup-$(date +%Y%m%d-%H%M%S).dump"

✅ Expected: File created, size > 0. `pg_restore --list backup.dump | head -20` shows tables.

Verify backup is restorable (test monthly):

pg_restore --dbname=$STAGING_DATABASE_URL backup.dump

psql $STAGING_DATABASE_URL -c "SELECT count(*) FROM users;"

✅ Expected: Row count matches production.

## Migration

Always test in staging first

DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy

Verify, then:

DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate deploy

✅ Expected: `All migrations have been successfully applied.`

⚠️ For large table migrations (> 1M rows), use `pg_repack` or add column with DEFAULT separately to avoid table locks.

## Vacuum & Reindex

Check bloat before deciding

psql $DATABASE_URL -c "

SELECT schemaname, tablename,

pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,

n_dead_tup, n_live_tup,

ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_ratio

FROM pg_stat_user_tables

ORDER BY n_dead_tup DESC LIMIT 10;"

Vacuum high-bloat tables (non-blocking)

psql $DATABASE_URL -c "VACUUM ANALYZE users;"

psql $DATABASE_URL -c "VACUUM ANALYZE events;"

Reindex (use CONCURRENTLY to avoid locks)

psql $DATABASE_URL -c "REINDEX INDEX CONCURRENTLY users_email_idx;"

✅ Expected: dead_ratio drops below 5% after vacuum.

Staleness Detection

Add a staleness header to every runbook:

## Staleness Check
This runbook references the following config files. If they've changed since the
"Last verified" date, review the affected steps.

| Config File | Last Modified | Affects Steps |
|-------------|--------------|---------------|
| vercel.json | `git log -1 --format=%ci -- vercel.json` | Step 3, Rollback |
| prisma/schema.prisma | `git log -1 --format=%ci -- prisma/schema.prisma` | Step 2, DB Maintenance |
| .github/workflows/deploy.yml | `git log -1 --format=%ci -- .github/workflows/deploy.yml` | Step 1, Step 3 |
| docker-compose.yml | `git log -1 --format=%ci -- docker-compose.yml` | All scaling steps |

Automation: Add a CI job that runs weekly and comments on the runbook doc if any referenced file was modified more recently than the runbook's "Last verified" date.

Runbook Testing Methodology

Dry-Run in Staging

Before trusting a runbook in production, validate every step in staging:

# 1. Create a staging environment mirror
vercel env pull .env.staging
source .env.staging

# 2. Run each step with staging credentials
# Replace all $DATABASE_URL with $STAGING_DATABASE_URL
# Replace all production URLs with staging URLs

# 3. Verify expected outputs match
# Document any discrepancies and update the runbook

# 4. Time each step — update estimates in the runbook
time npx prisma migrate deploy

Quarterly Review Cadence

Schedule a 1-hour review every quarter:

Run each command in staging — does it still work?
Check config drift — compare "Last Modified" dates vs "Last verified"
Test rollback procedures — actually roll back in staging
Update contact info — L1/L2/L3 may have changed
Add new failure modes discovered in the past quarter
Update "Last verified" date at top of runbook

Common Pitfalls

Pitfall	Fix
---	---
Commands that require manual copy of dynamic values	Use env vars — `$DATABASE_URL` not `postgres://user:pass@host/db`
No expected output specified	Add ✅ with exact expected string after every verification step
Rollback steps missing	Every destructive step needs a corresponding undo
Runbooks that never get tested	Schedule quarterly staging dry-runs in team calendar
L3 escalation contact is the former CTO	Review contact info every quarter
Migration runbook doesn't mention table locks	Call out lock risk for large table operations explicitly

Best Practices

Every command must be copy-pasteable — no placeholder text, use env vars
✅ after every step — explicit expected output, not "it should work"
Time estimates are mandatory — engineers need to know if they have time to fix before SLA breach
Rollback before you deploy — plan the undo before executing
Runbooks live in the repo — docs/runbooks/, versioned with the code they describe
Postmortem → runbook update — every incident should improve a runbook
Link, don't duplicate — reference the canonical config file, don't copy its contents into the runbook
Test runbooks like you test code — untested runbooks are worse than no runbooks (false confidence)

runbook-generator

概述