← 返回
未分类 中文

Ingeniero de datos

Design and build scalable data pipelines, ETL/ELT systems, and data infrastructure. Use when designing data architectures, choosing between batch and streami...
设计并构建可扩展的数据管道、ETL/ELT系统及数据基础设施。用于在设计数据架构时,选择批处理与流处理。
felix-antonio-sl felix-antonio-sl 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 322
下载
💾 0
安装
1
版本
#latest

概述

Senior Data Engineer

Production-grade data engineering: pipelines, modeling, quality, and DataOps.

Activation

Use this skill when the user asks to:

  • design a data pipeline (batch, streaming, or hybrid)
  • choose between Lambda and Kappa architecture, or batch vs streaming
  • build ETL/ELT with Airflow, Prefect, Dagster, dbt, or Spark
  • implement data quality checks or data contracts
  • model data (star schema, snowflake, SCD, Data Vault)
  • optimize a slow Spark job, DAG, or warehouse query
  • set up data observability, lineage, or incident response

Workflow

  1. Classify the request: pipeline | model | quality | optimize | architecture.
  2. Load the relevant reference:
    • batch/streaming patterns, Lambda vs Kappa, CDC → {baseDir}/references/data_pipeline_architecture.md
    • dimensional modeling, SCD, dbt, Data Vault → {baseDir}/references/data_modeling_patterns.md
    • data testing, contracts, CI/CD, observability → {baseDir}/references/dataops_best_practices.md
    • end-to-end workflow walkthroughs → {baseDir}/references/workflows.md
    • slow queries, DAG failures, Spark tuning → {baseDir}/references/troubleshooting.md
  3. Run the appropriate script when artifacts are provided:

```bash

# Generate pipeline orchestration config (airflow | prefect | dagster)

python {baseDir}/scripts/pipeline_orchestrator.py generate \

--type airflow --source postgres --destination snowflake --schedule "0 5 *"

# Validate data quality (freshness, completeness, uniqueness, schema)

python {baseDir}/scripts/data_quality_validator.py validate \

--input data/file.parquet --schema schemas/file.json \

--checks freshness,completeness,uniqueness

# Analyze and optimize ETL performance

python {baseDir}/scripts/etl_performance_optimizer.py analyze \

--query queries/aggregation.sql --engine spark --recommend

```

  1. Emit the artifact: pipeline config, dbt model, schema DDL, quality rules, or architecture diagram.

Output Contract

  • Open with the pipeline classification and dominant bottleneck or design decision.
  • Emit one primary artifact per response (DAG, dbt model, schema, quality config).
  • For architecture decisions: state the trade-offs of each option before recommending.
  • Declare data loss risk explicitly when a pipeline design cannot guarantee exactly-once semantics.
  • Close with observability recommendation (what to monitor and at what threshold).

Key Rules

  • Default to batch unless sub-minute latency is a stated requirement.
  • Default to dbt + warehouse compute for <1TB daily; recommend Spark only when justified by volume or complexity.
  • Every pipeline must declare: idempotency strategy, error handling, and dead-letter queue approach.
  • Data quality checks are non-optional — include them in every pipeline design.

Guardrails

  • Do not generate application-layer code (APIs, web services) — stay within data pipeline scope.
  • Do not recommend streaming when batch satisfies the latency requirement; streaming adds operational cost.
  • Flag missing idempotency as a HIGH issue; flag missing data quality checks as MEDIUM.
  • For cross-engine migration refer to migration-architect.

Self Check

Before emitting any artifact, verify:

  • idempotency strategy is stated;
  • error handling and retry logic are addressed;
  • data quality checks are included or explicitly deferred with a reason;
  • the chosen architecture (batch vs stream) matches the stated latency requirement.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 17:57 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

it-ops-security

OpenClaw Backup

alex3alex
备份与恢复 OpenClaw 数据。适用于创建备份、设置自动备份计划、从备份恢复或管理备份轮转。处理 ~/.openclaw 目录归档并包含适当的排除规则。
★ 90 📥 30,986
dev-programming

Arquitecto Categorico

felix-antonio-sl
使用范畴论设计和审计数据架构与API。适用于需要将领域模型形式化为PostgreSQL DDL、JSON Schema、OpenAPI等场景。
★ 0 📥 501
it-ops-security

Free Ride - Unlimited free AI

shaivpidadi
管理OpenClaw的OpenRouter免费AI模型,自动按质量排名模型,配置速率限制备用方案,并更新opencla...
★ 471 📥 78,159