概述

Big Data Management and Applications

Inputs to collect

Domain context: Is this for learning/education, professional work, or project implementation?
Specific area: Does the user focus on data collection, storage, processing, analysis, or application?
Tech stack preference: Any specific tools or frameworks the user prefers (e.g., Hadoop, Spark, Flink)?
Problem type: Is this a theoretical question, practical implementation, or solution design?

Procedure

Core Knowledge Areas

1. Data Collection and Integration

Real-time data collection: Flume, Kafka Connect, logstash
Batch data ingestion: Sqoop, DataX, Kafka
Data formats: JSON, CSV, Parquet, ORC, Avro
Data validation and quality checks

2. Storage Architecture

Distributed file systems: HDFS, Ceph
Data lakes: Delta Lake, Iceberg, Hudi
NoSQL databases: HBase, MongoDB, Cassandra
Time-series databases: InfluxDB, TimescaleDB
Data warehouse: Hive, ClickHouse, StarRocks, Doris

3. Processing Frameworks

Batch processing: MapReduce, Spark SQL, Flink Batch
Stream processing: Kafka Streams, Flink, Spark Streaming, Storm
ETL pipelines: Airflow, DolphinScheduler, Azkaban
Data transformation: Spark DataFrame, Flink Table API

4. Analysis and Computing

SQL engines: Presto, Trino, Hive LLAP, Spark Thrift Server
OLAP engines: ClickHouse, Druid, Kylin, Doris
Machine learning: Spark MLlib, XGBoost on Spark, TensorFlow on Spark
Graph processing: GraphX, Neo4j, Gremlin

5. Data Governance

Data catalog: Apache Atlas, DataHub, OpenMetadata
Data lineage: Apache Griffin, Great Expectations
Data quality: Deequ, Great Expectations, Delta Lake schema enforcement
Data security: Ranger, Sentry, column-level encryption

6. Practical Application Scenarios

Real-time data dashboard and monitoring
User behavior analysis and recommendation systems
Risk control and fraud detection
Data assets and monetization
Business intelligence and reporting

Solution Design Framework

Assess requirements

Data volume, velocity, variety assessment
Latency requirements (real-time vs batch)
Analytical complexity needs

Architecture selection

Lambda architecture vs Kappa architecture
Data mesh vs traditional data warehouse
Cloud-native vs on-premise considerations

Technology stack recommendation

Match specific requirements to appropriate tools
Consider team expertise and learning curve
Evaluate cost and operational complexity

Implementation roadmap

Quick wins vs long-term architecture
Migration strategy from legacy systems
Performance tuning and optimization

Output contract

Provide:

Clear, actionable guidance or solution design
Technology recommendations with rationale
Code examples for implementation when needed
Architecture diagrams in text format when helpful
Comparison of alternatives when relevant

Failure handling

For highly specific technical questions outside current knowledge: acknowledge limitations and provide best effort guidance
For emerging technologies not in training data: suggest official documentation and community resources
When user needs hands-on implementation: recommend specific tutorials or documentation

Examples

Example 1: Real-time data pipeline design

Input: "设计一个日均处理10亿条数据的实时分析系统"

Output: Provide architecture covering Kafka for ingestion, Flink for processing, ClickHouse for real-time OLAP, with data flow diagrams and key configurations

Example 2: Data lake migration

Input: "如何将传统数据仓库迁移到现代数据湖架构"

Output: Provide phased migration plan, tool selection rationale (Iceberg vs Hudi vs Delta Lake), and data governance recommendations

Example 3: Performance optimization

Input: "Spark job 运行很慢，怎么排查和优化"

Output: Provide troubleshooting checklist: shuffle optimization, partition tuning, memory configuration, data skew handling, with specific parameter recommendations

Reference Resources

For detailed implementation guides, refer to:

Apache official documentation (Hadoop, Spark, Flink, Kafka)
Cloud provider big data services (AWS EMR, Azure Databricks, GCP Dataproc)
Open source project GitHub repositories and best practices
Industry case studies and architecture patterns

版本历史

共 1 个版本

v1.0.0 Initial release 当前

2026-06-02 19:07 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)