← 返回
开发者工具 中文

Prometheus

Prometheus monitoring — scrape configuration, service discovery, recording rules, alert rules, and production deployment for infrastructure and application metrics.
Prometheus 监控——抓取配置、服务发现、录制规则、告警规则,以及基础设施和应用指标的生产部署。
wpank wpank 来源
开发者工具 clawhub v1.0.0 1 版本 99914.9 Key: 无需
★ 1
Stars
📥 1,154
下载
💾 4
安装
1
版本
#latest

概述

Prometheus

Production Prometheus setup covering scrape configuration, service discovery,

recording rules, alert rules, and operational best practices for infrastructure

and application monitoring.

When to Use

ScenarioExample
-------------------
Set up metrics collectionNew service needs Prometheus scraping
Configure service discoveryK8s pods, file-based, or static targets
Create recording rulesPre-compute expensive PromQL queries
Design alert rulesSLO-based alerts for availability and latency
Production deploymentHA setup with retention and storage planning
Troubleshoot scrapingTargets down, metrics missing, relabeling issues

Architecture

Applications ──(/metrics)──→ Prometheus Server ──→ AlertManager → Slack/PD
      ↑                           │
  client libraries          ├──→ Grafana (dashboards)
  (prom client)             └──→ Thanos/Cortex (long-term storage)

Installation

Kubernetes (Helm)

helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageVolumeSize=50Gi

Core Configuration

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: production
    region: us-west-2

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Self-monitoring
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

  # Node exporters
  - job_name: node-exporter
    static_configs:
      - targets: ["node1:9100", "node2:9100", "node3:9100"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: "([^:]+)(:[0-9]+)?"
        replacement: "${1}"

  # Application metrics (TLS)
  - job_name: my-app
    scheme: https
    metrics_path: /metrics
    tls_config:
      ca_file: /etc/prometheus/ca.crt
    static_configs:
      - targets: ["app1:9090", "app2:9090"]

Service Discovery

Kubernetes Pods (Annotation-Based)

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod

Pod annotations to enable scraping:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
    prometheus.io/path: "/metrics"

File-Based Discovery

scrape_configs:
  - job_name: file-sd
    file_sd_configs:
      - files: ["/etc/prometheus/targets/*.json"]
        refresh_interval: 5m

targets/production.json:

[{
  "targets": ["app1:9090", "app2:9090"],
  "labels": { "env": "production", "service": "api" }
}]

Discovery Method Comparison

MethodBest ForDynamic
---------------------------
static_configsFixed infrastructure, devNo
file_sd_configsCM-managed inventoriesYes (file watch)
kubernetes_sd_configsK8s workloadsYes (API watch)
consul_sd_configsConsul service meshYes (Consul watch)
ec2_sd_configsAWS EC2 instancesYes (API poll)

Recording Rules

Pre-compute expensive queries for dashboard and alert performance:

# /etc/prometheus/rules/recording_rules.yml
groups:
  - name: api_metrics
    interval: 15s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_error_rate:ratio
        expr: job:http_errors:rate5m / job:http_requests:rate5m

      - record: job:http_duration:p95
        expr: >
          histogram_quantile(0.95,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: resource_metrics
    interval: 30s
    rules:
      - record: instance:node_cpu:utilization
        expr: >
          100 - (avg by (instance)
            (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: instance:node_memory:utilization
        expr: >
          100 - ((node_memory_MemAvailable_bytes
            / node_memory_MemTotal_bytes) * 100)

      - record: instance:node_disk:utilization
        expr: >
          100 - ((node_filesystem_avail_bytes
            / node_filesystem_size_bytes) * 100)

Naming Convention

level:metric_name:operations
PartExampleMeaning
------------------------
leveljob:, instance:Aggregation level
metric_namehttp_requestsBase metric
operations:rate5m, :ratioApplied functions

Alert Rules

# /etc/prometheus/rules/alert_rules.yml
groups:
  - name: availability
    rules:
      - alert: ServiceDown
        expr: up{job="my-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.instance }} is down"
          description: "{{ $labels.job }} down for >1 minute"

      - alert: HighErrorRate
        expr: job:http_error_rate:ratio > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Error rate {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighP95Latency
        expr: job:http_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P95 latency {{ $value }}s for {{ $labels.job }}"

  - name: resources
    rules:
      - alert: HighCPU
        expr: instance:node_cpu:utilization > 80
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "CPU {{ $value }}% on {{ $labels.instance }}"

      - alert: HighMemory
        expr: instance:node_memory:utilization > 85
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Memory {{ $value }}% on {{ $labels.instance }}"

      - alert: DiskSpaceLow
        expr: instance:node_disk:utilization > 90
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Disk {{ $value }}% on {{ $labels.instance }}"

Alert Severity Guide

SeverityThresholdResponse
-------------------------------
criticalService down, data loss riskPage on-call immediately
warningDegraded, approaching limitInvestigate within hours
infoNotable but not urgentReview in next business day

Validation

# Validate config syntax
promtool check config prometheus.yml

# Validate rule files
promtool check rules /etc/prometheus/rules/*.yml

# Test a query
promtool query instant http://localhost:9090 'up'

# Reload config without restart
curl -X POST http://localhost:9090/-/reload

Best Practices

PracticeDetail
------------------
Naming: prefix_name_unitSnake_case, _total for counters, _seconds/_bytes for units
Scrape intervals 15–60sShorter wastes resources and storage
Recording rules for dashboardsPre-compute anything queried repeatedly
Monitor Prometheus itselfprometheus_tsdb_*, scrape_duration_seconds
HA deployment2+ instances scraping same targets
Retention planningMatch --storage.tsdb.retention.time to disk capacity
Federation for scaleGlobal Prometheus aggregates from regional instances
Long-term storageThanos or Cortex for >30d retention

Troubleshooting Quick Reference

ProblemDiagnosisFix
-------------------------
Target shows DOWNCheck /targets page for errorFix firewall, verify endpoint, check TLS
Metrics missingQuery up{job="x"}Verify scrape config, check /metrics endpoint
High cardinalityprometheus_tsdb_head_series growingDrop high-cardinality labels with metric_relabel_configs
Storage filling upCheck prometheus_tsdb_storage_*Reduce retention, add disk, enable compaction
Slow queriesCheck prometheus_engine_query_duration_secondsAdd recording rules, reduce range, limit series
Config not appliedCheck prometheus_config_last_reload_successfulFix syntax, POST /-/reload

NEVER Do

Anti-PatternWhyDo Instead
------------------------------
Scrape interval < 5sOverwhelms targets and storageUse 15–60s intervals
High-cardinality labels (user ID, request ID)Explodes TSDB series countUse logs for high-cardinality data
Alert without for durationFires on transient spikesAlways set for: 1m minimum
Skip recording rulesDashboards compute expensive queries every loadPre-compute with recording rules
Store secrets in prometheus.ymlConfig often in GitUse file-based secrets or env substitution
Ignore up metricMiss targets silently going downAlert on up == 0 for all jobs
Single Prometheus instance in prodSingle point of failureRun 2+ replicas with shared targets
Unbounded retentionDisk fills, Prometheus crashesSet explicit --storage.tsdb.retention.time

Templates

TemplateDescription
-----------------------
templates/prometheus.ymlFull config with static, file-based, and K8s discovery
templates/alert-rules.yml25+ alert rules by category
templates/recording-rules.ymlPre-computed metrics for HTTP, latency, resources, SLOs

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 02:17 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

it-ops-security

MoltGuard - Security & Antivirus & Guardrails

thomaslwang
MoltGuard — OpenClaw 安全守卫,由 OpenGuardrails 提供。安装 MoltGuard,保护您和您的用户免受提示注入、数据泄露和恶意攻击。
★ 116 📥 30,817
it-ops-security

OpenClaw Backup

alex3alex
备份与恢复 OpenClaw 数据。适用于创建备份、设置自动备份计划、从备份恢复或管理备份轮转。处理 ~/.openclaw 目录归档并包含适当的排除规则。
★ 90 📥 30,727
dev-programming

Code Review

wpank
涵盖安全、性能、可维护性、正确性和测试的系统化代码审查模式,包含严重等级、结构化反馈指南、审查流程及需避免的反模式。适用于审查 PR、建立审查标准或提升审查质量。
★ 33 📥 17,350