概述

Senior Data Engineer

Production-grade data engineering skill for building scalable, reliable data systems.

Trigger Phrases

Activate this skill when you see:

Pipeline Design:

"Design a data pipeline for..."
"Build an ETL/ELT process..."
"How should I ingest data from..."
"Set up data extraction from..."

Architecture:

"Should I use batch or streaming?"
"Lambda vs Kappa architecture"
"How to handle late-arriving data"
"Design a data lakehouse"

Data Modeling:

"Create a dimensional model..."
"Star schema vs snowflake"
"Implement slowly changing dimensions"
"Design a data vault"

Data Quality:

"Add data validation to..."
"Set up data quality checks"
"Monitor data freshness"
"Implement data contracts"

Performance:

"Optimize this Spark job"
"Query is running slow"
"Reduce pipeline execution time"
"Tune Airflow DAG"

Quick Start

Core Tools

# Generate pipeline orchestration config
python scripts/pipeline_orchestrator.py generate \
  --type airflow \
  --source postgres \
  --destination snowflake \
  --schedule "0 5 * * *"

# Validate data quality
python scripts/data_quality_validator.py validate \
  --input data/sales.parquet \
  --schema schemas/sales.json \
  --checks freshness,completeness,uniqueness

# Optimize ETL performance
python scripts/etl_performance_optimizer.py analyze \
  --query queries/daily_aggregation.sql \
  --engine spark \
  --recommend

Workflows

→ See references/workflows.md for details

Architecture Decision Framework

Use this framework to choose the right approach for your data pipeline.

Batch vs Streaming

Criteria	Batch	Streaming
----------	-------	-----------
Latency requirement	Hours to days	Seconds to minutes
Data volume	Large historical datasets	Continuous event streams
Processing complexity	Complex transformations, ML	Simple aggregations, filtering
Cost sensitivity	More cost-effective	Higher infrastructure cost
Error handling	Easier to reprocess	Requires careful design

Decision Tree:

Is real-time insight required?
├── Yes → Use streaming
│   └── Is exactly-once semantics needed?
│       ├── Yes → Kafka + Flink/Spark Structured Streaming
│       └── No → Kafka + consumer groups
└── No → Use batch
    └── Is data volume > 1TB daily?
        ├── Yes → Spark/Databricks
        └── No → dbt + warehouse compute

Lambda vs Kappa Architecture

Aspect	Lambda	Kappa
--------	--------	-------
Complexity	Two codebases (batch + stream)	Single codebase
Maintenance	Higher (sync batch/stream logic)	Lower
Reprocessing	Native batch layer	Replay from source
Use case	ML training + real-time serving	Pure event-driven

When to choose Lambda:

Need to train ML models on historical data
Complex batch transformations not feasible in streaming
Existing batch infrastructure

When to choose Kappa:

Event-sourced architecture
All processing can be expressed as stream operations
Starting fresh without legacy systems

Data Warehouse vs Data Lakehouse

Feature	Warehouse (Snowflake/BigQuery)	Lakehouse (Delta/Iceberg)
---------	-------------------------------	---------------------------
Best for	BI, SQL analytics	ML, unstructured data
Storage cost	Higher (proprietary format)	Lower (open formats)
Flexibility	Schema-on-write	Schema-on-read
Performance	Excellent for SQL	Good, improving
Ecosystem	Mature BI tools	Growing ML tooling

Tech Stack

Category	Technologies
----------	--------------
Languages	Python, SQL, Scala
Orchestration	Airflow, Prefect, Dagster
Transformation	dbt, Spark, Flink
Streaming	Kafka, Kinesis, Pub/Sub
Storage	S3, GCS, Delta Lake, Iceberg
Warehouses	Snowflake, BigQuery, Redshift, Databricks
Quality	Great Expectations, dbt tests, Monte Carlo
Monitoring	Prometheus, Grafana, Datadog

Reference Documentation

1. Data Pipeline Architecture

See references/data_pipeline_architecture.md for:

Lambda vs Kappa architecture patterns
Batch processing with Spark and Airflow
Stream processing with Kafka and Flink
Exactly-once semantics implementation
Error handling and dead letter queues

2. Data Modeling Patterns

See references/data_modeling_patterns.md for:

Dimensional modeling (Star/Snowflake)
Slowly Changing Dimensions (SCD Types 1-6)
Data Vault modeling
dbt best practices
Partitioning and clustering

3. DataOps Best Practices

See references/dataops_best_practices.md for:

Data testing frameworks
Data contracts and schema validation
CI/CD for data pipelines
Observability and lineage
Incident response

Troubleshooting

→ See references/troubleshooting.md for details

版本历史

共 2 个版本

v2.1.1 当前

2026-03-28 12:21 安全安全
v1.0.0

2026-03-11 09:34

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

Senior Data Engineer

概述

Senior Data Engineer

Table of Contents

Trigger Phrases

Quick Start

Core Tools

Workflows

Architecture Decision Framework

Batch vs Streaming

Lambda vs Kappa Architecture

Data Warehouse vs Data Lakehouse

Tech Stack

Reference Documentation

1. Data Pipeline Architecture

2. Data Modeling Patterns

3. DataOps Best Practices

Troubleshooting

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Excel / XLSX

A股量化 AkShare

Data Analysis