← 返回
未分类 中文

Data Pipelines

Deep data pipeline workflow—ingestion, orchestration, idempotency, data quality, SLAs, observability, and lineage. Use when building batch/stream pipelines,...
深度数据管道工作流—包括数据摄取、编排、幂等性、数据质量、SLA、可观测性和血缘。适用于构建批处理/流处理管道,...
mike47512 mike47512 来源
未分类 clawhub v1.0.0 1 版本 99763.6 Key: 无需
★ 0
Stars
📥 422
下载
💾 0
安装
1
版本
#latest

概述

Data Pipelines

Pipelines fail on silent schema drift, partial writes, and unclear ownership. Design for at-least-once delivery, idempotent sinks, and observable stages.

When to Offer This Workflow

Trigger conditions:

  • Batch or streaming ingestion (Kafka, Fivetran, Airflow, Dagster, Spark, etc.)
  • Late data, backfills, or schema changes breaking jobs
  • SLA misses on freshness or row counts

Initial offer:

Use six stages: (1) requirements & SLAs, (2) source contracts, (3) transforms & idempotency, (4) orchestration & dependencies, (5) quality & monitoring, (6) lineage & operations). Confirm batch vs stream and cloud stack.


Stage 1: Requirements & SLAs

Goal: Freshness (latency), completeness expectations, cost ceiling, failure tolerance (quarantine vs stop-the-line).

Exit condition: SLA table: pipeline → metric → threshold.


Stage 2: Source Contracts

Goal: Schema versioning; CDC vs snapshot pulls; API rate limits.

Practices

  • Raw landing zone immutable; curated layers downstream

Stage 3: Transforms & Idempotency

Goal: Deterministic transforms; upsert keys; partition strategy for rewinds.

Practices

  • Watermark progress for incremental loads

Stage 4: Orchestration & Dependencies

Goal: Clear DAG; retry policy; backfill without double counting; SLA miss alerts.


Stage 5: Quality & Monitoring

Goal: Data quality checks (null spikes, row bounds, referential checks); metrics on lag, duration, error rate.


Stage 6: Lineage & Operations

Goal: Column-level lineage where valuable; on-call runbook; ownership per pipeline.


Final Review Checklist

  • [ ] SLAs and failure policy explicit
  • [ ] Source contracts and schema evolution path
  • [ ] Idempotent writes and checkpointing
  • [ ] Orchestration with retries and safe backfill
  • [ ] Data quality checks and alerts
  • [ ] Lineage and ownership documented

Tips for Effective Guidance

  • Separate compute from storage cost awareness for large shuffles.
  • Pair with etl-design for batch patterns and message-queues for streaming handoffs.

Handling Deviations

  • Single-script pipelines: still document inputs, outputs, and schedule.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-03 08:21 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Tavily 搜索

jacky1n7
通过 Tavily API 进行网页搜索(Brave 替代方案)。当用户要求搜索网页、查找来源或链接,且 Brave 网页搜索不可用时使用。
★ 272 📥 100,137
data-analysis

AdMapix

fly0pants
AdMapix 原始数据层,提供广告创意、应用、排名、下载/收入及市场元数据。返回 AdMapix API 的结构化 JSON;调用方...
★ 296 📥 139,160
design-media

Visual

mike47512
提供平面设计、UI交互、PPT美化及品牌调性升级指引。
★ 0 📥 2,064