Analyze, diagnose, and optimize SQL queries for better database performance.
When a user provides a SQL query, first identify what information is already given vs. what's missing. Ask for or infer:
SHOW INDEX, pg_indexes, sys.indexes, USER_INDEXESEXPLAIN ANALYZE, EXPLAIN (ANALYZE, BUFFERS), SET STATISTICS IO ON, DBMS_XPLANIf the query alone is given without context, use SQL pattern analysis to detect common anti-patterns.
Scan the query for all known anti-patterns:
| Category | Anti-pattern | Severity | Detection Rule |
|---|---|---|---|
| ---------- | ------------- | ---------- | ---------------- |
| Filtering | WHERE FUNC(col) = value (function on column) | High | Regex: WHERE\s+\w+\( on column ref |
| Filtering | WHERE col LIKE '%prefix' (leading wildcard) | High | Regex: LIKE\s+'% |
| Filtering | WHERE col IS NULL / IS NOT NULL on indexed column | Medium | Most B-tree indexes don't store NULLs |
| Filtering | Implicit type conversion | High | WHERE varchar_col = 123 — check operand types |
| Filtering | WHERE OR on different indexed columns | Medium | Only one index can be used per table ref |
| Join | Missing join index (FK column not indexed) | High | Check if join column has matching index |
| Join | Join on expression (not raw column) | High | Prevents index usage |
| Join | Cross join without WHERE condition | Critical | Accidental Cartesian product |
| Subquery | Correlated subquery executing row-by-row | High | Subquery references outer query alias |
| Subquery | IN (SELECT ...) returning large result set | Medium | Materializes full subquery result |
| Subquery | NOT IN on nullable column | Critical | Always returns empty result set |
| Sort | ORDER BY on non-indexed column (no LIMIT) | Medium | Full sort on all matching rows |
| Sort | ORDER BY RAND() | High | Table scan + sort all rows |
| Sort | ORDER BY with mixed ASC/DESC not matching index | Medium | Index can't satisfy sort order |
| Agg | GROUP BY on high-cardinality unindexed column | High | Hash aggregate on all rows |
| Agg | HAVING col = val without aggregate function | Medium | Should be WHERE filter |
| Scan | SELECT * in OLTP query | Low-Medium | Unnecessary column IO + network |
| Scan | SELECT DISTINCT on PK or unique column | Low | Redundant dedup |
| Page | OFFSET with large value | High | Re-scans and discards all prior rows |
| Page | LIMIT without ORDER BY | Low | Non-deterministic result set |
| CTE | CTE referenced multiple times causing re-evaluation | Medium | Not materialized in some databases |
| Set | UNION instead of UNION ALL when no duplicates | Medium | Extra sort for dedup |
| Set | INTERSECT / EXCEPT on large sets | Medium | Requires full sort on both sides |
| Lock | FOR UPDATE on read-replica / high-contention table | Medium | Blocks concurrent readers |
| Lock | SERIALIZABLE isolation where READ COMMITTED suffices | Low | Extra locking overhead |
| DML | Large batch UPDATE/DELETE without chunking | High | Long transaction, log blowup, lock escalation |
| DML | INSERT INTO t SELECT * without column list | Medium | Fragile to schema changes |
Parse and explain these key plan metrics:
| Metric | What It Means | Warning Threshold |
|---|---|---|
| -------- | --------------- | ------------------- |
| seq scan | Full table scan — reads every row sequentially | Any row > 10,000 |
| index scan | Walks B-tree leaf nodes, then fetches rows from heap | Fetched rows > 10% of table |
| index-only scan | All needed columns present in index, no heap access | Always ideal |
| bitmap scan (PG) | Builds bitmap of matching page locations from index | OK for moderate selectivity |
| rows vs. actual rows | Planner estimate vs. actual rows processed | Estimate off by > 10x |
| filter | Rows read but discarded by WHERE clause | Filter discards > 50% |
| sort (memory) | Sort performed entirely in memory (work_mem / sort_buffer) | OK under configured memory |
| sort (disk) | Sort spilled to temp disk file (runs out of memory) | Always problematic |
| nested loop | For each outer row, probe inner table | Outer > 1000 rows is risky |
| hash join | Build hash table on one side, probe with other | OK for large unindexed joins |
| merge join | Sort both sides, then merge-consume | Good for pre-sorted data |
| materialize | CTE / subquery result written to temp storage | Repeated access degrades perf |
| loops | Number of times a plan node is executed | > 1 on inner index scan = nested loop |
| buffers (shared hit) | Pages found in buffer cache | Low = memory efficient |
| buffers (shared read) | Pages read from disk (cache miss) | High = need more shared_buffers / buffer pool |
| temp written | Data spilled to temp files | > 0 = insufficient memory for sort/hash |
| planning time | Time spent in query optimization | > 100ms for simple query |
| execution time | Actual runtime | Target depends on SLA |
| Database | Keywords to Watch |
|---|---|
| ---------- | ------------------ |
| MySQL | Using temporary, Using filesort, Using where (after index), Using index condition |
| PostgreSQL | Parallel Seq Scan, Materialize, Sort Method: external merge, Rows Removed by Filter |
| SQL Server | Table Scan, Index Scan (not seek), Key Lookup (clustered), RID Lookup, Spool, Sort |
| Oracle | TABLE ACCESS FULL (large table), SORT (AGGREGATE), NESTED LOOPS (high cardinality) |
Format every diagnosis as a Markdown table with three columns:
| Problem | Impact | Suggestion |
|---------|--------|------------|
| … | … | … |
Sort rows by severity (Critical → High → Medium → Low).
After the diagnosis table, provide a rewritten query with:
Always produce analysis in this structure:
| Problem | Impact | Suggestion |
|---|---|---|
| --------- | -------- | ------------ |
-- Original
SELECT * FROM orders WHERE YEAR(created_at) = 2024;
-- Optimized
SELECT * FROM orders
WHERE created_at >= '2024-01-01'
AND created_at < '2025-01-01';
Explanation: Replaced function-on-column YEAR(created_at) with range predicate, enabling index seek on created_at. Expected: seq scan → index range scan.
| Index | Type | Reason |
|---|---|---|
| ------- | ------ | -------- |
CREATE INDEX idx_orders_created ON orders(created_at) | B-tree | Supports the range filter in WHERE |
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at) | Composite (covering) | Covers filter + sort in one scan |
| Operator | Est. Rows | Actual Rows | Cost | Issue |
|---|---|---|---|---|
| ---------- | ----------- | ------------- | ------ | ------- |
| Seq Scan on orders | 100k | 500k | 12450 | Full table scan — missing index on WHERE column |
| Access Pattern | Index Type | Example DDL |
|---|---|---|
| ---------------- | ----------- | ------------- |
| Equality on one column | B-tree single-column | CREATE INDEX ON t(col) |
| Equality + range | B-tree composite, equality col first | CREATE INDEX ON t(a, b) for WHERE a=1 AND b>5 |
| Multiple equality filters | B-tree composite, most selective first | CREATE INDEX ON t(a, b) for WHERE a=1 AND b=2 |
| Sort order match | B-tree with matching ASC/DESC | CREATE INDEX ON t(col DESC) for ORDER BY col DESC |
| Prefix / case-insensitive search | Expression index or trigram GiST | CREATE INDEX ON t(LOWER(col)) |
| Full-text search | GIN (PG) / FULLTEXT (MySQL) | CREATE FULLTEXT INDEX ON t(col) |
| JSON / JSONB field access | GIN, or B-tree on expression | CREATE INDEX ON t((json_col->>'field')) |
| Geospatial queries | GiST (PG) / SPATIAL (MySQL) | CREATE INDEX ON t USING GIST(geo) |
| Covering query | B-tree with INCLUDE | CREATE INDEX ON t(a) INCLUDE (b, c) (PG) |
| Partial / filtered | B-tree with WHERE clause | CREATE INDEX ON t(col) WHERE active=true |
| Selectivity | Rows Matched | Recommended Action |
|---|---|---|
| ------------- | ------------- | ------------------- |
| High | > 10% | Index unlikely to help; consider scan + partition pruning |
| Moderate | 1–10% | Composite index may help; verify with EXPLAIN |
| Selective | < 1% | Index very effective; ensure leading column matches filter |
| Issue | Impact | Fix |
|---|---|---|
| ------- | -------- | ----- |
| Missing index on FK column | Nested loop join → seq scan on inner table | Add index on FK column |
| Redundant indexes | Extra write overhead, cache pressure | Drop unused indexes (check pg_stat_user_indexes / sys.dm_db_index_usage_stats) |
| Index fragmentation (B-tree page splits) | More IO per scan, larger index size | Rebuild / REINDEX during low load |
| Over-indexed table (many indexes on one table) | Slow DML (each index updated per row) | Consolidate: composite index can replace multiple single-col indexes |
| Stale index statistics | Optimizer chooses suboptimal plan | ANALYZE (PG) / UPDATE STATISTICS (SQL Server) / OPTIMIZE (MySQL) |
EXPLAIN ANALYZE SELECT ... \G -- MySQL 8.0.18+, actual runtime
Using index = covering index, no table access — goodUsing where = rows filtered after storage engine read — check selectivityUsing temporary = temp table created — avoid for large dataUsing filesort = sort outside index — add composite index covering ORDER BYUsing index condition = index condition pushdown (ICP) — good, MySQL 5.6+Impossible WHERE = WHERE always false — dead codeINNODB_BUFFER_POOL_SIZE should fit working setEXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT ...;
rows vs actual rows mismatch (> 10x) → stale stats, run ANALYZEParallel Seq Scan = normal for tables > 1 GB with sufficient max_parallel_workers_per_gatherSort Method: external merge Disk → increase work_memMATERIALIZED / NOT MATERIALIZED hintNOT IN on nullable column → always returns 0 rows. Use NOT EXISTSpg_stat_user_tables.n_dead_tup > 20% of n_live_tup → vacuum behindpg_stat_user_indexes → check idx_scan to find unused indexespg_stat_statements → identify most time-consuming queriesSET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO
SELECT ...;
PAGEIOLATCH_SH wait type = disk IO bottleneck (insufficient buffer pool)LCK_M_* wait types = blocking / lock contentionCXPACKET wait = parallelism skew (one thread slower than rest)Table Scan vs Clustered Index Scan — both full scans, but CI scan may be narrower if clustered PK is narrowKey Lookup (clustered) / RID Lookup (heap) = non-covering index → expensive per-row lookups. Add INCLUDE columnssys.dm_db_missing_index_detailsOPTION (RECOMPILE) or OPTIMIZE FOR UNKNOWNNOLOCK hint = dirty reads. Prefer READ COMMITTED SNAPSHOT database setting-- After query runs, get actual plan:
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(FORMAT=>'ALLSTATS LAST +OUTLINE'));
TABLE ACCESS FULL — not always bad; Oracle multiblock reads are efficient for large tablesINDEX UNIQUE SCAN → ideally 1 row returnedINDEX RANGE SCAN → check number of rows scannedNESTED LOOPS effective when driving set is small (< 10k rows) and inner table has unique index accessHASH JOIN preferred for large unindexed joinsSORT (AGGREGATE) without GROUP BY → aggregates all rows, verify cardinality firstAUTOTRACE TRACEONLY STATISTICS| Anti-Pattern | Quick Fix | Expected Gain |
|---|---|---|
| ------------- | ----------- | --------------- |
WHERE YEAR(col) = 2024 | WHERE col >= '2024-01-01' AND col < '2025-01-01' | Index seek replaces seq scan |
WHERE LIKE '%keyword' | Add full-text index + MATCH ... AGAINST / @@ | Eliminate full scan |
OFFSET 10000 LIMIT 20 | Keyset: WHERE id > 10000 LIMIT 20 | Skip 10000 discarded row reads |
SELECT * in OLTP | Explicit column list only needed columns | Less IO, less network, less memory |
IN (SELECT ...) (large set) | EXISTS (SELECT 1 ...) | Short-circuits on first match |
ORDER BY RAND() | TABLESAMPLE / WHERE id IN (random selection) | Full scan replaced by direct access |
Large UPDATE/DELETE in one shot | Batch in chunks of 1000 rows | Reduce lock duration, log growth, blocking |
OR across different indexed cols | UNION ALL (each branch uses its own index) | Index seek per branch instead of full scan |
HAVING col = val (no aggregate) | Move condition to WHERE | Filter before GROUP BY, fewer rows to aggregate |
| Correlated subquery | Rewrite as JOIN or window function | Eliminate row-by-row execution |
NOT IN (subquery) | NOT EXISTS | Safe with NULLs, often faster |
| Multiple CTE references | Materialize explicitly or use temp table | Avoid repeated evaluation |
Query:
SELECT * FROM orders
WHERE YEAR(created_at) = 2024
ORDER BY total_amount DESC;
Diagnosis:
| Problem | Impact | Suggestion |
|---|---|---|
| --------- | -------- | ------------ |
YEAR(created_at) prevents index usage | Full table scan forced, index on created_at unusable | Replace with range predicate: created_at >= '2024-01-01' AND created_at < '2025-01-01' |
ORDER BY total_amount not indexed | Filesort on all matching rows (~500k) | Add composite index: (created_at, total_amount) |
SELECT * returns all columns | Extra IO on large text/blob columns | List only id, user_id, total_amount, status |
Rewritten:
SELECT id, user_id, total_amount, status
FROM orders
WHERE created_at >= '2024-01-01'
AND created_at < '2025-01-01'
ORDER BY total_amount DESC;
Index:
CREATE INDEX idx_orders_created_amount ON orders(created_at, total_amount DESC);
Query:
SELECT u.name, COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-06-01'
GROUP BY u.id, u.name;
Diagnosis:
| Problem | Impact | Suggestion |
|---|---|---|
| --------- | -------- | ------------ |
orders.user_id not indexed | Nested loop join seq-scans orders for each matching user | Add index: CREATE INDEX idx_orders_user_id ON orders(user_id) |
GROUP BY u.id, u.name | Extra sort/hash on all columns in SELECT | In MySQL, GROUP BY on PK suffices if name is functionally dependent |
Rewritten:
SELECT u.name, o.order_count
FROM users u
LEFT JOIN (SELECT user_id, COUNT(*) AS order_count
FROM orders
GROUP BY user_id) o ON u.id = o.user_id
WHERE u.created_at > '2024-06-01';
Index:
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_users_created ON users(created_at);
共 1 个版本