← 返回
数据分析 中文

Logging Observability

Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
用于构建可观测系统的结构化日志、分布式追踪和指标采集模式。适用于实现日志基础设施、OpenTelemetry 分布式追踪、指标采集设计(RED/USE 方法)、告警仪表盘配置及可观测性审查。涵盖结构化 JSON 日志、上下文传播、链路采样、Prometheus/Grafana 技术栈、告警设计及敏感信息脱敏。
wpank
数据分析 clawhub v0.1.0 1 版本 100000 Key: 无需
★ 2
Stars
📥 2,175
下载
💾 42
安装
1
版本
#latest

概述

Logging & Observability

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

Three Pillars

PillarPurposeQuestion It AnswersExample
-----------------------------------------------
LogsWhat happenedWhy did this request fail?{"level":"error","msg":"payment declined","user_id":"u_82"}
MetricsHow much / how fastIs latency increasing?http_request_duration_seconds{route="/api/orders"} 0.342
TracesRequest flowWhere is the bottleneck?Span: api-gateway → auth → order-service → db

Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.


Structured Logging

Always emit logs as structured JSON — never free-text strings.

Required Fields

FieldPurposeRequired
--------------------------
timestampISO-8601 with millisecondsYes
levelSeverity (DEBUG … FATAL)Yes
serviceOriginating service nameYes
messageHuman-readable descriptionYes
trace_idDistributed trace correlationYes
span_idCurrent span within traceYes
correlation_idBusiness-level correlation (order ID)When applicable
errorStructured error objectOn errors
contextRequest-specific metadataRecommended

Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically:

app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    user_id: req.user?.id,
    method: req.method,
    path: req.path,
  };
  asyncLocalStorage.run(ctx, () => next());
});

Library Recommendations

LibraryLanguageStrengthsPerf
------------------------------------
PinoNode.jsFastest Node logger, low overheadExcellent
structlogPythonComposable processors, context bindingGood
zerologGoZero-allocation JSON loggingExcellent
zapGoHigh performance, typed fieldsExcellent
tracingRustSpans + events, async-awareExcellent

Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.


Log Levels

LevelWhen to UseExample
-----------------------------
FATALApp cannot continue, process will exitDatabase connection pool exhausted
ERROROperation failed, needs attentionPayment charge failed: CARD_DECLINED
WARNUnexpected but recoverableRetry 2/3 for upstream timeout
INFONormal business eventsOrder ORD-1234 placed successfully
DEBUGDeveloper troubleshootingCache miss for key user:82:preferences
TRACEVery fine-grained (rarely in prod)Entering validateAddress with payload

Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.


Distributed Tracing

OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Span Creation

const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total_cents', order.totalCents);
      await validateInventory(order);
      await chargePayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Context Propagation

  • Use W3C Trace Context (traceparent header) — default in OTel
  • Propagate across HTTP, gRPC, and message queues
  • For async workers: serialise traceparent into the job payload

Trace Sampling

StrategyUse When
--------------------
Always OnLow-traffic services, debugging
Probabilistic (N%)General production use
Rate-limited (N/sec)High-throughput services
Tail-basedWhen you need all error traces

Always sample 100% of error traces regardless of strategy.


Metrics Collection

RED Method (Request-Driven)

Monitor these three for every service endpoint:

MetricWhat It MeasuresPrometheus Example
--------------------------------------------
RateRequests/secrate(http_requests_total[5m])
ErrorsFailed request ratiorate(http_requests_total{status=~"5.."}[5m])
DurationResponse timehistogram_quantile(0.99, http_request_duration_seconds)

USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network):

MetricWhat It MeasuresExample
----------------------------------
Utilization% resource busyCPU usage at 78%
SaturationWork queued/waiting12 requests queued in thread pool
ErrorsError events on resource3 disk I/O errors in last minute

Monitoring Stack

ToolCategoryBest For
--------------------------
PrometheusMetricsPull-based metrics, alerting rules
GrafanaVisualisationDashboards for metrics, logs, traces
JaegerTracingDistributed trace visualisation
LokiLogsLog aggregation (pairs with Grafana)
OpenTelemetryCollectionVendor-neutral telemetry collection

Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.


Alert Design

Severity Levels

SeverityResponse TimeExample
----------------------------------
P1ImmediateService fully down, data loss
P2< 30 minError rate > 5%, latency p99 > 5s
P3Business hoursDisk > 80%, cert expiring in 7 days
P4Best effortNon-critical deprecation warning

Alert Fatigue Prevention

  • Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
  • Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
  • Require runbook links — every alert must link to diagnosis and remediation
  • Review monthly — delete or tune alerts that never fire or always fire
  • Group related alerts — use inhibition rules to suppress child alerts
  • Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete

Dashboard Patterns

Overview Dashboard ("War Room")

  • Total requests/sec across all services
  • Global error rate (%) with trendline
  • p50 / p95 / p99 latency
  • Active alerts count by severity
  • Deployment markers overlaid on graphs

Service Dashboard (Per-Service)

  • RED metrics for each endpoint
  • Dependency health (upstream/downstream success rates)
  • Resource utilisation (CPU, memory, connections)
  • Top errors table with count and last seen

Observability Checklist

Every service must have:

  • [ ] Structured JSON logging with consistent schema
  • [ ] Correlation / trace IDs propagated on all requests
  • [ ] RED metrics exposed for every external endpoint
  • [ ] Health check endpoints (/healthz and /readyz)
  • [ ] Distributed tracing with OpenTelemetry
  • [ ] Dashboards for RED metrics and resource utilisation
  • [ ] Alerts for error rate, latency, and saturation with runbook links
  • [ ] Log level configurable at runtime without redeployment
  • [ ] PII scrubbing verified and tested
  • [ ] Retention policies defined for logs, metrics, and traces

Anti-Patterns

Anti-PatternProblemFix
---------------------------
Logging PIIPrivacy/compliance violationMask or exclude PII; use token references
Excessive loggingStorage costs balloon, signal drownsLog business events, not data flow
Unstructured logsCannot query or alert on fieldsUse structured JSON with consistent schema
String interpolationBreaks structured fields, injection riskPass fields as metadata, not in message
Missing correlation IDsCannot trace across servicesGenerate and propagate trace_id everywhere
Alert stormsOn-call fatigue, real issues buriedUse grouping, inhibition, deduplication
Metrics with high cardinalityPrometheus OOM, dashboard timeoutsNever use user ID or request ID as label

NEVER Do

  1. NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
  2. NEVER use console.log / print in production — use a structured logger
  3. NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
  4. NEVER create alerts without a runbook link — unactionable alerts erode trust
  5. NEVER rely on logs alone — you need metrics and traces for full observability
  6. NEVER log request/response bodies by default — opt-in only, with PII redaction
  7. NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
  8. NEVER skip context propagation in async flows — broken traces are worse than no traces

版本历史

共 1 个版本

  • v0.1.0 当前
    2026-03-28 20:24 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

developer-tools

Code Review

wpank
涵盖安全、性能、可维护性、正确性和测试的系统化代码审查模式,包含严重等级、结构化反馈指南、审查流程及需避免的反模式。适用于审查 PR、建立审查标准或提升审查质量。
★ 31 📥 17,099
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 198 📥 65,082
data-analysis

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 368 📥 140,394