You need to determine whether a system's data access patterns require a transactional database (OLTP), an analytic data warehouse (OLAP), or both — and then recommend the right architecture for each workload type.
Typical triggers:
Before starting: if the user is asking specifically about which storage engine (B-tree vs LSM-tree) to use for an OLTP system, use storage-engine-selector instead. If the user is asking about ETL pipeline architecture in detail, use batch-pipeline-designer after this skill.
fact_ or dim_ table prefixes, heavy normalization vs wide flat tables.sql files — presence of GROUP BY, SUM, COUNT, AVG, JOIN chains, WHERE year = predicates is an OLAP signalfind_by_id, save, update patterns = OLTP. Analytical query builders = OLAP.SUFFICIENT: query patterns + who runs them + write pattern are known
PROCEED WITH DEFAULTS: query patterns known but volume/latency not quantified
MUST ASK: query patterns are unknown
ACTION: Evaluate the described workload against the 6-dimension comparison table (Table 3-1 from the book). Assign OLTP or OLAP for each dimension. Count the majority.
WHY: OLTP and OLAP workloads differ not in degree but in kind. A single database engine is optimized for one or the other — not both. Row-oriented storage excels at OLTP (low-latency point lookups, frequent small writes) but is inefficient for OLAP (must load entire rows even when only 3 of 100 columns are needed). Column-oriented storage compresses well and accelerates aggregate scans but makes individual row inserts expensive. Classifying on all 6 dimensions prevents misclassification from a single misleading signal (e.g., a small analytic workload might have fast query times — that alone doesn't make it OLTP).
| Dimension | OLTP Signal | OLAP Signal |
|---|---|---|
| ----------- | ------------- | ------------- |
| Main read pattern | Small number of records fetched by key (point lookup) | Aggregate over large number of records (scan + compute) |
| Main write pattern | Random-access, low-latency writes from user input | Bulk import (ETL) or event stream ingestion |
| Primarily used by | End user / customer via web or mobile application | Internal analyst, for business intelligence and decision support |
| What data represents | Latest state (current point in time) | History of events that happened over time |
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
| Bottleneck | Disk seek time (index lookup cost) | Disk bandwidth (volume of data scanned) |
Scoring:
Output: Classification label + dimension-by-dimension score table.
ACTION: Based on classification, select the recommended architecture path and proceed to the corresponding sub-step.
WHY: The routing decision is binary at its core — OLTP and OLAP systems have fundamentally different storage engine designs. Running analytic queries on an OLTP database does not just perform poorly; it actively harms the concurrent transactions by consuming disk I/O and CPU that the storage engine's indexes cannot help with. Enterprises learned this in the late 1980s and began extracting analytic workloads into separate data warehouses specifically to protect OLTP availability. The routing step enforces this architectural boundary.
ACTION: For OLTP workloads, confirm the storage engine class and index strategy. This step is a quick check — detailed OLTP guidance lives in storage-engine-selector.
WHY: OLTP systems are optimized for individual record access: fast point reads (via B-tree or LSM-tree indexes), low-latency writes, and high concurrent user support. The storage engine should be selected to match the write-to-read ratio and durability requirements. Key decisions are: log-structured (LSM-tree) for write-heavy workloads, update-in-place (B-tree) for read-heavy workloads with mixed updates.
Guidance:
transaction-isolation-selector)If the user also needs analytics → proceed to Step 3B after this step, and plan ETL from OLTP to OLAP.
ACTION: Design the data warehouse schema using the star or snowflake schema pattern. Identify the central fact table and its dimension tables.
WHY: Data warehouses use a different schema paradigm than operational databases. Operational databases are normalized to minimize write anomalies. But normalization hurts analytic queries — analysts need to join many tables, and every join adds latency at scan scale. The star schema deliberately denormalizes into one wide fact table (every business event as a row) surrounded by dimension tables (the who, what, where, when, why of each event). This layout makes the most common analytic queries simple: filter dimensions, join to fact table, aggregate. Column-oriented storage then further optimizes this by reading only the few columns each query actually touches rather than loading full rows.
Sub-step 3B-1: Identify the fact table
The fact table is the center of the star. Each row represents one business event — one sale, one page view, one sensor reading, one log entry. Key characteristics:
Ask: "What is the core business event you are tracking? What gets recorded every time that event occurs?"
Sub-step 3B-2: Identify dimension tables
Dimension tables answer the "who, what, where, when, how, why" of each event in the fact table. Key characteristics:
dim_date) — this allows encoding attributes like is_holiday, weekday, fiscal_quarter that enable time-based filtering without date arithmeticStandard dimension table set for a retail fact table:
| Dimension | Purpose | Example columns |
|---|---|---|
| ----------- | --------- | ----------------- |
dim_date | Time-based filtering and grouping | date_key, year, month, day, weekday, is_holiday |
dim_product | Product attributes for filtering/grouping | product_sk, sku, description, brand, category |
dim_store | Store/location attributes | store_sk, state, city, store_type, open_date |
dim_customer | Customer attributes | customer_sk, name, date_of_birth, segment |
dim_promotion | Promotion/campaign attributes | promotion_sk, name, ad_type, coupon_type |
Sub-step 3B-3: Choose star vs snowflake schema
| Schema | Structure | When to use |
|---|---|---|
| -------- | ----------- | ------------- |
| Star schema | Dimension tables are flat (denormalized) | Preferred for analyst usability — simpler SQL, fewer joins, faster iteration |
| Snowflake schema | Dimensions further normalized into sub-dimensions (e.g., dim_brand split out from dim_product) | When storage is a constraint or dimension data integrity is critical; harder for analysts to query |
Default to star schema unless there is a specific normalization requirement. Analysts work with the schema daily — simplicity compounds.
Sub-step 3B-4: Apply column-oriented storage
WHY column-oriented storage matters for OLAP: A typical analytic query accesses 4-5 columns out of 100+ in the fact table. Row-oriented storage must load every column of every matching row from disk — paying I/O cost for 95+ columns that are immediately discarded. Column-oriented storage keeps each column in a separate file. The query only reads the files for the columns it needs. For a 100-column fact table, this is a 20x reduction in I/O for a 5-column query.
Additional benefits of column storage:
product_sk repeating across millions of purchases). Run-length encoding and bitmap encoding compress these dramatically — often 10x or more.date_key) enables range scans that skip large portions of data. Store multiple sort orders across replicas for different query patterns.Column storage decision criteria:
Use column-oriented storage when:
Defer column storage (use row-oriented) when:
Sub-step 3B-5: Plan the ETL pipeline
WHY a separate ETL pipeline: OLTP databases must remain highly available for user-facing transactions. Business analysts running heavy scans on the OLTP database consume disk I/O and table locks that starve concurrent transactions. The solution is to export data on a schedule (periodic dump or continuous stream of change events) from OLTP systems, transform it into the warehouse schema, and load it into a separate read-only warehouse. This is Extract–Transform–Load (ETL).
ETL design decisions:
| Decision | Options | Guidance |
|---|---|---|
| ---------- | --------- | ---------- |
| Extraction method | Periodic full dump, incremental delta export, change data capture (CDC) stream | CDC is lowest-latency; full dumps are simplest but expensive at scale |
| Transformation | In the pipeline (ETL) or in the warehouse after loading (ELT) | ELT preferred for modern cloud warehouses with strong SQL compute (Snowflake, BigQuery, Redshift) |
| Load frequency | Nightly batch, hourly micro-batch, near-real-time | Match to analyst freshness requirements — nightly is often sufficient |
| Schema management | Separate warehouse schema from OLTP schema | Never query OLTP tables directly from analyst tools |
For detailed ETL/batch pipeline architecture, use batch-pipeline-designer.
ACTION: When the workload shows both OLTP and OLAP signals (typically 3-4 signals on each side), design a two-tier architecture that separates operational and analytical processing.
WHY: Hybrid Transactional/Analytical Processing (HTAP) is the most common real-world scenario — an application database that also needs to support reporting. Running both on the same database is the default, and the default fails at scale: analytic queries lock rows, consume disk bandwidth, and compete with user-facing transactions for buffer pool space. The solution is always architectural separation: one system optimized for transactions, one for analytics, with a data pipeline connecting them. The separation can be light (read replica + materialized views for small scale) or full (dedicated warehouse at large scale).
Separation options by scale:
| Scale | Approach | Tooling |
|---|---|---|
| ------- | ---------- | --------- |
| Small (GB, few analysts) | Read replica + materialized views | PostgreSQL read replica, scheduled view refresh |
| Medium (TB, regular reporting) | Lightweight warehouse or columnar extension | DuckDB, ClickHouse, or PostgreSQL + TimescaleDB |
| Large (multi-TB, BI team) | Dedicated data warehouse + ETL pipeline | Snowflake, BigQuery, Redshift, Apache Hive |
| Very large (PB, real-time analytics) | Streaming pipeline + columnar store | Kafka CDC + Apache Pinot, Druid, or Flink + Iceberg |
Decision rule: If OLTP system availability is critical (SLA requirements), always separate — even at small scale. Analytic queries running on a production database are an availability risk that compounds as data grows.
ACTION: Write a concise architecture decision document capturing the classification, recommendation, and key trade-offs.
WHY: OLTP/OLAP decisions have downstream consequences for schema design, team structure (DBA vs data engineer), tooling procurement, and pipeline work. Documenting the reasoning prevents the decision from being relitigated as systems grow, and gives future engineers the context for why the architecture is the way it is.
Use the output template below.
# Workload Classification: {System Name}
## Classification Result: {OLTP / OLAP / Hybrid}
### 6-Dimension Scorecard
| Dimension | Your Workload | Signal |
|-----------|--------------|--------|
| Main read pattern | {description} | OLTP / OLAP |
| Main write pattern | {description} | OLTP / OLAP |
| Primary users | {description} | OLTP / OLAP |
| What data represents | {description} | OLTP / OLAP |
| Dataset size | {estimate} | OLTP / OLAP |
| Primary bottleneck | {seek time / bandwidth} | OLTP / OLAP |
**Score: {X} OLTP / {Y} OLAP → Classification: {label}**
## Architecture Recommendation
### {OLTP / OLAP / Both}
**Recommended architecture:** {description}
**Key decisions:**
- Storage engine: {row-oriented / column-oriented / both}
- Schema: {normalized / star schema / snowflake schema}
- Data pipeline: {none needed / ETL nightly / CDC streaming}
- Separation strategy: {single DB / read replica / dedicated warehouse}
### Schema Design (if OLAP)
**Fact table:** `fact_{event}` — one row per {event type}
- Metrics: {list measurable columns}
- Foreign keys: {list dimension references}
**Dimension tables:**
- `dim_date` — {key time attributes}
- `dim_{entity}` — {key descriptive attributes}
- {additional dimensions}
**Schema choice:** Star / Snowflake — {reason}
### Column Storage Decision
{Apply / Defer} column-oriented storage — {reasoning}
### ETL Plan (if applicable)
- Extraction: {method}
- Transformation: {ETL in pipeline / ELT in warehouse}
- Load frequency: {schedule}
- Tools: {recommended}
## Trade-offs
**What we gain:** {performance, separation, scalability}
**What we accept:** {operational complexity, pipeline latency, cost}
## Next Steps
1. {First concrete action}
2. {Second action}
3. {If OLAP: use batch-pipeline-designer for ETL pipeline design}
Running analytics on the OLTP database. The most common mistake. Business analysts get direct database credentials "temporarily" and the arrangement becomes permanent. As data grows, analytic queries take longer, table scans compete with user transactions for buffer pool, and OLTP latency degrades. Prevention: enforce separation early, before it becomes a political problem.
Misclassifying a hybrid workload as pure OLAP. If a system needs both fresh operational data (OLTP) and historical aggregate analysis (OLAP), designing only a warehouse misses the operational layer. The warehouse will always lag behind the operational system by the ETL interval — if analysts need current-minute data, a warehouse-only design fails.
Choosing snowflake schema when star schema suffices. Snowflake schemas are more normalized but require more joins in every analytic query. For most data warehouses, the storage savings don't justify the analyst experience penalty. Default to star schema and only move to snowflake when storage is genuinely constrained or dimension table updates are frequent.
Not accounting for write complexity of column-oriented storage. Column storage is optimized for reads. An update-in-place approach requires rewriting all column files for each affected row. Use LSM-tree ingestion (batch writes accumulate in a row-oriented in-memory store, then merge-flush to column files) for any column store that needs ingestion throughput. Systems like Vertica do this natively.
Building a data cube too early and losing query flexibility. Materialized aggregates (OLAP cubes) precompute answers to known queries and are very fast. But they can't answer questions their dimensions don't include. Most warehouses keep raw fact table data as the primary store and use cubes only as a performance layer for known high-frequency queries. Don't replace raw data with cubes.
ETL pipeline latency mismatch. A nightly ETL batch is inadequate if analysts need same-day data for decisions. Design the freshness requirement into the ETL architecture from the start — nightly batch, hourly micro-batch, or CDC streaming have very different pipeline architectures.
Scenario: E-commerce company with slow reporting
Trigger: "Our monthly sales reports are taking 20 minutes to run, and our DBAs say it's affecting checkout latency."
Process: Step 1 scoring — monthly reports aggregate all sales by region, category, and promotion (OLAP: read pattern, data represents history, disk bandwidth bottleneck); checkout is user-triggered individual record writes (OLTP: write pattern, used by end users, latest state). Score: 3 OLTP / 3 OLAP → Hybrid. Step 4: two-tier separation. OLTP: PostgreSQL for checkout and inventory. OLAP: dedicated warehouse (Redshift or Snowflake). ETL: nightly batch from OLTP to warehouse — analysts run against warehouse, never production DB. Step 3B: star schema with fact_orders at center (one row per order line), dimensions: dim_product, dim_store, dim_customer, dim_date, dim_promotion. Column storage on warehouse — fact table has 85 columns, typical reports use 6-8.
Output: Hybrid architecture. Nightly ETL extracts from PostgreSQL, loads star schema in Snowflake. Checkout latency restored. Analysts get dedicated compute without SLA risk.
Scenario: SaaS application building its first analytics feature
Trigger: "We want to add a dashboard showing our customers their usage trends over the past 12 months."
Process: Step 1 scoring — usage trends require scanning all events per customer over a year, aggregating by day/week (OLAP: read pattern, aggregation, history). But users trigger writes (OLAP + OLTP mix on write side). Volume: currently 50GB, projected 500GB in two years. Response time: dashboard can tolerate 2-3 second loads. Step 4: small-scale separation. Read replica with pre-aggregated materialized views is sufficient now; design schema to migrate to a proper warehouse when data exceeds 1TB or query complexity grows. Step 3B: fact_usage_events (one row per usage event per user per day), dim_date, dim_user, dim_feature. Star schema. Column-oriented storage deferred until >1TB — current data fits in memory.
Output: Start with PostgreSQL read replica + materialized views refreshed hourly. Schema designed as star schema from day one. Explicit migration trigger: add dedicated warehouse when data exceeds 1TB or query refresh time exceeds 10 seconds.
Scenario: IoT sensor data platform
Trigger: "We collect readings from 10,000 sensors every 30 seconds. We need to detect anomalies in real-time but also run monthly trend analysis."
Process: Step 1 scoring — 10,000 sensors × 2 readings/min × 60 min × 24 hr = ~28.8M rows/day. Historical trend analysis scans months of data for aggregation (OLAP: read pattern, dataset size, bandwidth bottleneck, history). Real-time anomaly detection reads the latest N readings per sensor (OLTP: read by key, latest state, low-latency). Score: 3 OLTP / 3 OLAP → Hybrid, but the write pattern (streaming ingestion at high throughput) tilts toward OLAP infrastructure. Step 4: streaming architecture — Kafka for ingestion, Flink or Spark Streaming for real-time anomaly detection (OLTP path), Apache Iceberg or ClickHouse for historical columnar storage (OLAP path). Step 3B: fact_readings (one row per sensor reading with timestamp, sensor_id, value, unit), dim_sensor (location, type, calibration metadata), dim_date. Column storage applied immediately — 30B rows/year, queries scan months of data for trend analysis.
Output: Two-path architecture. Real-time: Kafka → Flink anomaly detection → alert system. Historical: Kafka → ClickHouse (columnar) for trend queries. Shared ingestion pipeline, separate read paths. Cross-reference batch-pipeline-designer for ETL scheduling on the historical path.
storage-engine-selectorbatch-pipeline-designerThis skill is licensed under CC-BY-SA-4.0.
Source: BookForge — Designing Data-Intensive Applications by Martin Kleppmann.
This skill is standalone. Browse more BookForge skills: bookforge-skills
共 1 个版本
暂无安全检测报告