← 返回
未分类 中文

Sre Practices

Deep SRE workflow—SLOs/SLIs, error budgets, alerting, toil reduction, incident readiness, capacity, and balancing reliability with delivery. Use when improvi...
深度SRE工作流程—服务等级目标/指标、错误预算、告警、运维自动化、事件就绪、容量规划及可靠性与交付平衡。用于改进...
codekungfu codekungfu 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 401
下载
💾 1
安装
1
版本
#latest

概述

SRE Practices (Deep Workflow)

SRE is not “ops with a fancy title”—it is engineering reliability with explicit trade-offs between velocity and stability, measured with SLOs and managed through error budgets and toil budgets.

When to Offer This Workflow

Trigger conditions:

  • Defining or revisiting SLOs; too many pages or too few alerts
  • “We need five nines” without user-visible meaning
  • High toil: manual deploys, ticket-driven scaling, runbooks that never shrink
  • Post-incident push for “more reliability” without cost discussion

Initial offer:

Walk through six stages: (1) user journeys & SLIs, (2) SLO targets & windows, (3) error budgets & policy, (4) alerting & on-call, (5) toil & automation, (6) continuous improvement. Confirm service tiering and business criticality.


Stage 1: User Journeys & SLIs

Goal: Measure what users actually experience, not only server uptime.

Activities

  • List critical journeys: signup, pay, search, API sync, etc.
  • For each, pick SLI types: availability, latency, freshness, correctness (where measurable)
  • Define SLI implementation: e.g., “successful HTTP 2xx from LB / all requests excluding health checks” vs deeper synthetic probes

Good SLIs

  • Specific, measurable, aligned with pain—avoid vanity metrics

Exit condition: SLI definitions documented with data sources (metrics, logs, probes).


Stage 2: SLO Targets & Windows

Goal: Set achievable targets with explicit consequences.

Process

  • Choose window: rolling 30d common; align with release cadence
  • Set target (e.g., 99.9% availability) from error budget math: allowed downtime per month
  • Tier services: not everything needs 99.99%

Realism

  • Account for dependencies you don’t control (public cloud, third-party APIs)—SLO cannot exceed dependency SLO unless architecture isolates failures.

Exit condition: Published SLO document per service or journey with measurement method.


Stage 3: Error Budget Policy

Goal: Decide how to spend budget—feature velocity vs reliability work.

Policy Examples

  • Budget healthy → ship aggressively; budget low → freeze risky changes, focus on reliability
  • Exceptions process: who can override, with what review

Communication

  • Product/engineering shared ownership of budget—not “SRE says no” in the dark

Exit condition: Written policy: what happens when budget burns at 25/50/100%.


Stage 4: Alerting & On-Call

Goal: Pages are symptom-based, actionable, low noise.

Principles

  • Alert on user pain or imminent SLO threat, not every blip
  • Severity maps to response: SEV1 customer-wide vs warning
  • Runbooks linked; ownership clear

On-Call Health

  • Limit pages per engineer per week; track toil hours
  • Post-incident follow-through to reduce repeat pages

Exit condition: Alert inventory reviewed; tuning backlog for noisy alerts.


Stage 5: Toil & Automation

Goal: Reduce manual, repetitive, automatable work with measurable toil budgets.

Identify Toil

  • Frequent tickets, manual scaling, click-ops deploys, data fixes without guardrails

Remediate

  • Eliminate > automate > document—in that preference order when safe
  • Self-service platforms with guardrails beat hero scripts

Exit condition: Toil reduction roadmap with owners; ideally 50% toil cap aspiration per team norm (Google SRE guideline—adapt to org).


Stage 6: Continuous Improvement

Goal: Reliability work is prioritized like features.

Loops

  • Incident → action items with tracking
  • Game days / failure injection where mature
  • Quarterly SLO review—targets drift with product changes

Final Review Checklist

  • [ ] SLIs tied to user-visible outcomes
  • [ ] SLO targets realistic vs dependencies
  • [ ] Error budget policy agreed with product
  • [ ] Alerts actionable; noise tracked
  • [ ] Toil identified with automation path

Tips for Effective Guidance

  • Translate 99.9% to minutes of downtime per month—makes trade-offs concrete.
  • Never promise zero incidents; promise learning and measurable improvement.
  • Separate SLI (measurement) from SLO (target) from SLA (contract)—terms get confused.

Handling Deviations

  • Early startup: start with basic monitoring + incident reviews before full SLO program.
  • No SRE role: practices still apply—relabel “production excellence” if needed.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-31 07:10 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

business-ops

抖音电商

codekungfu
该技能介绍如何通过抖音电商实现变现;当你计划从事或优化抖音电商时调用。
★ 2 📥 1,971
it-ops-security

Free Ride - Unlimited free AI

shaivpidadi
管理OpenClaw的OpenRouter免费AI模型,自动按质量排名模型,配置速率限制备用方案,并更新opencla...
★ 471 📥 78,545
it-ops-security

OpenClaw Backup

alex3alex
备份与恢复 OpenClaw 数据。适用于创建备份、设置自动备份计划、从备份恢复或管理备份轮转。处理 ~/.openclaw 目录归档并包含适当的排除规则。
★ 90 📥 31,063