← 返回
未分类

Hologres Schema Generator

Hologres DDL schema design and table creation expert. Use for generating CREATE TABLE statements, choosing storage formats (column/row/row-column), configuri...
Hologres DDL模式设计与表创建专家。用于生成CREATE TABLE语句,选择存储格式(列/行/行列),配置...
wenbingyu
未分类 clawhub v0.2.0 1 版本 99543.4 Key: 无需
★ 0
Stars
📥 218
下载
💾 0
安装
1
版本
#latest

概述

Prerequisites

This skill requires hologres-cli to be installed first:

pip install hologres-cli
export HOLOGRES_SKILL=hologres-schema-generator

All SQL execution depends on hologres-cli commands (hologres sql run --write, hologres table create).

Hologres Schema Design & Table Creation

Generate optimized Hologres DDL statements with proper storage format, indexing, and distribution strategies.

Information Gathering (IMPORTANT)

Before generating DDL, check whether the user has provided enough context. Many Hologres table properties are immutable after creation (orientation, distribution_key, clustering_key, event_time_column, primary key, partition column) — getting them wrong means recreating the table.

Required Information

If ANY of the following are unclear, ask the user before generating DDL:

DimensionWhy CriticalExample Question
-----------------------------------------
Query patternDetermines orientation (column/row/row-column)"This table is mainly used for OLAP analytics, KV point lookups, or both?"
Column definitionsCore schema structure"Can you list the main columns and their rough types (text, number, timestamp, etc.)?"
Primary key / unique keyDetermines UPSERT capability; immutable"Does this table need deduplication or updates by a unique key? If so, which column(s)?"
Main query conditionsDetermines distribution_key, clustering_key, bitmap"What columns are most often used in WHERE, JOIN, or GROUP BY?"

Conditionally Required

Ask these only when relevant signals appear:

SignalFollow-up Question
---------------------------
Table looks like time-series or log data"What is the estimated daily data volume? (determines whether to use partitioning)"
User mentions JOIN with other tables"Which column will this table JOIN on, and what is the distribution_key of the other table?"
Large data volume or mentions retention"How long should data be retained? (recommend dynamic partition management for lifecycle)"
User mentions both analytics and serving"What is the read pattern — batch scans, point lookups by ID, or both?"

Can Use Defaults (No Need to Ask)

PropertyDefault Behavior
---------------------------
bitmap_columnsAuto-enabled for TEXT columns; can ALTER later
dictionary_encoding_columnsUse :auto mode; can ALTER later
storage_modeDefault hot; can ALTER later
time_to_live_in_secondsNot recommended. Deletion time is non-deterministic. Use dynamic partition management instead

Gathering Flow

User request → Check available info
  |
  +-- Columns + query pattern + key info all clear?
  |     → Generate DDL directly
  |
  +-- Missing critical info?
  |     → Ask 1-3 focused questions (batch them, don't ask one at a time)
  |
  +-- Very vague request (e.g., "help me create a user table")?
        → Ask: 1) main columns  2) query pattern  3) key/dedup needs

> Principle: Batch questions into a single round. Never ask more than 3-4 questions at once. If the user provides partial info, fill in reasonable defaults for the rest and explain your assumptions in comments.

Storage Formats

Choose the storage format based on your primary query pattern.

FormatSyntaxBest ForPrimary Key
---------------------------------------
Column storeorientation = 'column'OLAP, aggregation, scan-heavy queriesOptional
Row storeorientation = 'row'Point lookups, high-QPS KV queriesRequired
Row-column storeorientation = 'row,column'Mixed workloads (OLAP + point lookup)Required

> Default recommendation: Use row,column (row-column store) when the workload is unclear. It handles both OLAP and point queries well.

Quick Start

-- Column store table (OLAP / analytics)
CREATE TABLE orders (
  order_id BIGINT NOT NULL,
  user_id BIGINT,
  amount DECIMAL(18,2),
  status TEXT,
  order_time TIMESTAMPTZ NOT NULL,
  ds TEXT NOT NULL,
  PRIMARY KEY (order_id)
)
PARTITION BY LIST (ds)
WITH (
  orientation = 'column',
  distribution_key = 'order_id',
  clustering_key = 'order_time:asc',
  event_time_column = 'order_time',
  bitmap_columns = 'status,user_id'
);

-- Row store table (high-QPS point lookup)
CREATE TABLE user_profile (
  user_id BIGINT NOT NULL,
  name TEXT,
  email TEXT,
  tags JSONB,
  updated_at TIMESTAMPTZ,
  PRIMARY KEY (user_id)
)
WITH (
  orientation = 'row',
  distribution_key = 'user_id',
  clustering_key = 'user_id'
);

CREATE TABLE Syntax (V2.1+)

CREATE TABLE [IF NOT EXISTS] [schema.]table_name (
  column_name data_type [NOT NULL] [DEFAULT expr],
  ...
  [PRIMARY KEY (col1 [, col2, ...])]
)
[PARTITION BY LIST (partition_column)]
WITH (
  orientation = '{column | row | row,column}',
  distribution_key = 'col1[,col2]',
  clustering_key = 'col1[:asc|:desc][,col2[:asc|:desc]]',
  event_time_column = 'col',
  bitmap_columns = 'col1[,col2,...]',
  dictionary_encoding_columns = 'col1[:auto|:on|:off][,...]',
  time_to_live_in_seconds = 'N',          -- NOT recommended, see below
  storage_mode = '{hot | cold}',
  table_group = 'group_name'
);

Legacy Syntax (All Versions)

BEGIN;
CREATE TABLE table_name (...);
CALL set_table_property('table_name', 'orientation', 'column');
CALL set_table_property('table_name', 'distribution_key', 'col1');
CALL set_table_property('table_name', 'clustering_key', 'col1:asc');
CALL set_table_property('table_name', 'event_time_column', 'col1');
CALL set_table_property('table_name', 'bitmap_columns', 'col1,col2');
COMMIT;

Table Properties Reference

distribution_key (Distribution Key)

Controls how data is hash-distributed across shards. Critical for JOIN and GROUP BY performance.

RuleDescription
-------------------
Choose high-cardinality columnsAvoid data skew
Use JOIN/GROUP BY columnsEnable local computation, avoid shuffle
Max 2 columnsMore columns reduce distribution effectiveness
Must be subset of PKIf table has a primary key
Immutable after creationCannot ALTER, must recreate table
-- Single column
distribution_key = 'user_id'

-- Two columns (join on both)
distribution_key = 'order_id,user_id'

clustering_key (Clustering Key / Sorted Index)

Physically sorts data within files. Accelerates range queries and filters.

RuleDescription
-------------------
Use range-query columnse.g., timestamp, date columns
Max 2-3 columnsMore columns dilute sort benefit
Put high-selectivity column firstMost-filtered column goes first
Specify sort order:asc (default) or :desc
Column store onlyRow store uses PK as sort key
clustering_key = 'order_time:asc'
clustering_key = 'ds:asc,order_time:asc'

event_time_column (Segment Key)

Organizes data files by time ranges. Enables file-level pruning for time-range queries.

RuleDescription
-------------------
Use time/date columnsTimestamp of data ingestion or event time
At most 1 columnOnly one segment key per table
Column store onlyNot applicable to row store
Combine with partitionPartition for coarse pruning, segment key for fine pruning
event_time_column = 'event_time'

bitmap_columns (Bitmap Index)

Builds bitmap indexes for fast equality filtering on low-to-medium cardinality columns.

RuleDescription
-------------------
Use filter columnsColumns frequently in WHERE clause
Low-medium cardinalitystatus, type, region — NOT user_id
TEXT columns auto-enabledDefault bitmap for text columns
Can be added after creationALTER TABLE ... SET (bitmap_columns = ...)
bitmap_columns = 'status,payment_type,region'

dictionary_encoding_columns (Dictionary Encoding)

Compresses text columns by mapping values to integers. Speeds up GROUP BY and aggregations.

RuleDescription
-------------------
Use :auto modeLet Hologres decide based on cardinality
Good for low-cardinality textcountry, status, category
Avoid high-cardinality columnsuser_id, order_id — no compression benefit
Don't set if unsureIncorrect setting may hurt performance
dictionary_encoding_columns = 'country:auto,status:auto'

time_to_live_in_seconds (TTL) — NOT RECOMMENDED

> Do NOT use time_to_live_in_seconds. The actual deletion time is non-deterministic — data will be deleted at an arbitrary time after the specified TTL, not at a precise point. This makes it unreliable for data lifecycle management.

>

> Recommended alternative: Use dynamic partition management — create daily/hourly partitions and drop old partitions on a schedule (via cron or scheduling system). This gives you precise, predictable data lifecycle control.

-- BAD: TTL-based lifecycle (deletion time unpredictable)
-- time_to_live_in_seconds = '2592000'

-- GOOD: Partition-based lifecycle (precise control)
-- 1. Create table with daily partitions
CREATE TABLE events (
  event_id BIGINT NOT NULL,
  ds TEXT NOT NULL,
  PRIMARY KEY (ds, event_id)
) PARTITION BY LIST (ds)
WITH (orientation = 'column', distribution_key = 'event_id');

-- 2. Drop old partitions on schedule (e.g., retain 30 days)
DROP TABLE IF EXISTS events_20251101;  -- drop partition older than 30 days

Primary Key Design

Storage FormatPK RequirementNotes
-------------------------------------
Column storeOptionalAdd PK only if needed for UPSERT
Row storeRequiredPK drives the row-store index
Row-column storeRequiredPK serves both point lookup and analytics

Rules:

  • Max 32 columns in composite PK
  • PK columns must be NOT NULL and UNIQUE
  • Prohibited types: FLOAT, DOUBLE, NUMERIC, ARRAY, JSON, JSONB
  • PK cannot be altered after creation — must recreate table
  • Avoid SERIAL as PK — causes table-level locks on write

Partition Table Design

Use partitions for large tables with time-based or categorical data.

-- Parent table
CREATE TABLE events (
  event_id BIGINT NOT NULL,
  user_id BIGINT,
  event_type TEXT,
  ds TEXT NOT NULL,
  PRIMARY KEY (ds, event_id)
)
PARTITION BY LIST (ds)
WITH (
  orientation = 'column',
  distribution_key = 'event_id',
  clustering_key = 'ds:asc',
  event_time_column = 'ds'
);

-- Child partitions
CREATE TABLE events_20251201 PARTITION OF events FOR VALUES IN ('20251201');
CREATE TABLE events_20251202 PARTITION OF events FOR VALUES IN ('20251202');

Rules:

  • Only LIST partitioning is supported
  • Partition column must be part of PK (if PK exists)
  • Supported partition column types: TEXT, VARCHAR, INT, DATE (V1.3.22+)
  • Skip daily partitions if daily data < 100M rows — use event_time_column instead
  • Always filter on partition column in queries for pruning

Scenario-Based Templates

1. High-QPS Point Lookup (KV)

CREATE TABLE user_kv (
  user_id BIGINT NOT NULL PRIMARY KEY,
  profile JSONB,
  updated_at TIMESTAMPTZ
)
WITH (
  orientation = 'row',
  distribution_key = 'user_id'
);

2. Prefix Range Scan

CREATE TABLE order_lines (
  order_id BIGINT NOT NULL,
  line_no INT NOT NULL,
  product_id BIGINT,
  quantity INT,
  PRIMARY KEY (order_id, line_no)
)
WITH (
  orientation = 'row',
  distribution_key = 'order_id',
  clustering_key = 'order_id'
);

3. Time-Range Analytics (Partitioned)

CREATE TABLE page_views (
  view_id BIGINT NOT NULL,
  user_id BIGINT,
  page_url TEXT,
  view_time TIMESTAMPTZ,
  ds TEXT NOT NULL,
  PRIMARY KEY (ds, view_id)
)
PARTITION BY LIST (ds)
WITH (
  orientation = 'column',
  distribution_key = 'view_id',
  clustering_key = 'view_time:asc',
  event_time_column = 'view_time',
  bitmap_columns = 'user_id,page_url'
);

4. Dimension Filtering (Non-Time)

CREATE TABLE product_stats (
  product_id BIGINT NOT NULL PRIMARY KEY,
  category TEXT,
  brand TEXT,
  sales_count BIGINT,
  revenue DECIMAL(18,2)
)
WITH (
  orientation = 'column',
  distribution_key = 'product_id',
  clustering_key = 'category:asc',
  bitmap_columns = 'category,brand'
);

5. JOIN-Optimized Fact + Dimension

-- Fact table: distribute by join key
CREATE TABLE fact_orders (
  order_id BIGINT NOT NULL PRIMARY KEY,
  customer_id BIGINT,
  amount DECIMAL(18,2),
  order_date TEXT
)
WITH (
  orientation = 'column',
  distribution_key = 'customer_id'
);

-- Dimension table: same distribution key
CREATE TABLE dim_customers (
  customer_id BIGINT NOT NULL PRIMARY KEY,
  name TEXT,
  region TEXT
)
WITH (
  orientation = 'row,column',
  distribution_key = 'customer_id'
);
-- JOIN on customer_id → local join, no shuffle

6. Mixed Workload (OLAP + Serving)

CREATE TABLE realtime_metrics (
  metric_id BIGINT NOT NULL,
  device_id BIGINT NOT NULL,
  value DOUBLE PRECISION,
  ts TIMESTAMPTZ NOT NULL,
  ds TEXT NOT NULL,
  PRIMARY KEY (ds, device_id, metric_id)
)
PARTITION BY LIST (ds)
WITH (
  orientation = 'row,column',
  distribution_key = 'device_id',
  clustering_key = 'ts:asc',
  event_time_column = 'ts',
  bitmap_columns = 'device_id'
);

References

DocumentContent
-------------------
data-types.mdComplete data type reference
table-properties.mdDetailed table property guide with selection flowcharts
partition-guide.mdPartition table design and management

Best Practices

  1. Choose storage format first — column for OLAP, row for KV, row-column when unsure
  2. Set distribution_key to JOIN/GROUP BY columns — avoids cross-shard shuffle
  3. Set clustering_key for range-query columns — improves time-range and filter queries
  4. Set event_time_column for time-series data — enables file-level pruning
  5. Use bitmap_columns for low-cardinality filter columns — status, type, region
  6. Partition by date only if daily data > 100M rows — otherwise use segment key alone
  7. PK must include partition column — required by Hologres for partition tables
  8. Align distribution_key across JOINed tables — same column enables local join
  9. Use WITH syntax (V2.1+) — cleaner than CALL set_table_property in transactions
  10. Avoid SERIAL as primary key — causes table-level write locks

版本历史

共 1 个版本

  • v0.2.0 当前
    2026-05-21 15:30 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Hologres Bsi Profile Analysis

wenbingyu
Hologres BSI位切片索引画像分析技能,用于用户画像与标签计算,涵盖BSI表设计、数据导入、属性+行为标签联合人群圈选、GMV分析、标签分布统计、TopK查询及分桶并行计算等场景。
★ 0 📥 274

Hologres Slow Query Analysis

wenbingyu
Hologres慢查询日志分析与诊断技能:用于分析慢查询、失败查询、查询性能诊断及日志管理(阿里云)。
★ 0 📥 247

Hologres Privileges

wenbingyu
使用 PostgreSQL 标准授权模型(专家权限模型)进行 Hologres 权限管理,用于创建用户、授予/撤销 Schema/表权限
★ 0 📥 270