SQL Optimizer

Analyze, diagnose, and optimize SQL queries for better database performance.

Scope

Input: Any SQL query (SELECT, INSERT, UPDATE, DELETE, DDL), with optional context (table schema, index list, execution plan, database type).
Output: Structured Markdown table with Problem, Impact, and Suggestion columns; plus rewritten SQL where applicable.
Database support: MySQL, PostgreSQL, SQL Server, Oracle, SQLite, MariaDB.

Workflow

Step 1: Gather Context

When a user provides a SQL query, first identify what information is already given vs. what's missing. Ask for or infer:

Database type — MySQL, PostgreSQL, SQL Server, Oracle, SQLite
Table schemas — column types, NULLability, constraints
Existing indexes — via SHOW INDEX, pg_indexes, sys.indexes, USER_INDEXES
Execution plan — EXPLAIN ANALYZE, EXPLAIN (ANALYZE, BUFFERS), SET STATISTICS IO ON, DBMS_XPLAN
Cardinality & data distribution — estimated row count, value distribution
Query performance goals — reduce latency, reduce IO, reduce temp-disk usage, reduce lock contention

If the query alone is given without context, use SQL pattern analysis to detect common anti-patterns.

Step 2: Detect Performance Problems

Scan the query for all known anti-patterns:

Category	Anti-pattern	Severity	Detection Rule
----------	-------------	----------	----------------
Filtering	`WHERE FUNC(col) = value` (function on column)	High	Regex: `WHERE\s+\w+\(` on column ref
Filtering	`WHERE col LIKE '%prefix'` (leading wildcard)	High	Regex: `LIKE\s+'%`
Filtering	`WHERE col IS NULL` / `IS NOT NULL` on indexed column	Medium	Most B-tree indexes don't store NULLs
Filtering	Implicit type conversion	High	`WHERE varchar_col = 123` — check operand types
Filtering	`WHERE OR` on different indexed columns	Medium	Only one index can be used per table ref
Join	Missing join index (FK column not indexed)	High	Check if join column has matching index
Join	Join on expression (not raw column)	High	Prevents index usage
Join	Cross join without WHERE condition	Critical	Accidental Cartesian product
Subquery	Correlated subquery executing row-by-row	High	Subquery references outer query alias
Subquery	`IN (SELECT ...)` returning large result set	Medium	Materializes full subquery result
Subquery	`NOT IN` on nullable column	Critical	Always returns empty result set
Sort	`ORDER BY` on non-indexed column (no LIMIT)	Medium	Full sort on all matching rows
Sort	`ORDER BY RAND()`	High	Table scan + sort all rows
Sort	`ORDER BY` with mixed ASC/DESC not matching index	Medium	Index can't satisfy sort order
Agg	`GROUP BY` on high-cardinality unindexed column	High	Hash aggregate on all rows
Agg	`HAVING col = val` without aggregate function	Medium	Should be WHERE filter
Scan	`SELECT *` in OLTP query	Low-Medium	Unnecessary column IO + network
Scan	`SELECT DISTINCT` on PK or unique column	Low	Redundant dedup
Page	`OFFSET` with large value	High	Re-scans and discards all prior rows
Page	`LIMIT` without `ORDER BY`	Low	Non-deterministic result set
CTE	CTE referenced multiple times causing re-evaluation	Medium	Not materialized in some databases
Set	`UNION` instead of `UNION ALL` when no duplicates	Medium	Extra sort for dedup
Set	`INTERSECT` / `EXCEPT` on large sets	Medium	Requires full sort on both sides
Lock	`FOR UPDATE` on read-replica / high-contention table	Medium	Blocks concurrent readers
Lock	`SERIALIZABLE` isolation where READ COMMITTED suffices	Low	Extra locking overhead
DML	Large batch `UPDATE`/`DELETE` without chunking	High	Long transaction, log blowup, lock escalation
DML	`INSERT INTO t SELECT *` without column list	Medium	Fragile to schema changes

Step 3: Analyze Execution Plan (if provided)

Parse and explain these key plan metrics:

Metric	What It Means	Warning Threshold
--------	---------------	-------------------
seq scan	Full table scan — reads every row sequentially	Any row > 10,000
index scan	Walks B-tree leaf nodes, then fetches rows from heap	Fetched rows > 10% of table
index-only scan	All needed columns present in index, no heap access	Always ideal
bitmap scan (PG)	Builds bitmap of matching page locations from index	OK for moderate selectivity
rows vs. actual rows	Planner estimate vs. actual rows processed	Estimate off by > 10x
filter	Rows read but discarded by WHERE clause	Filter discards > 50%
sort (memory)	Sort performed entirely in memory (`work_mem` / `sort_buffer`)	OK under configured memory
sort (disk)	Sort spilled to temp disk file (runs out of memory)	Always problematic
nested loop	For each outer row, probe inner table	Outer > 1000 rows is risky
hash join	Build hash table on one side, probe with other	OK for large unindexed joins
merge join	Sort both sides, then merge-consume	Good for pre-sorted data
materialize	CTE / subquery result written to temp storage	Repeated access degrades perf
loops	Number of times a plan node is executed	> 1 on inner index scan = nested loop
buffers (shared hit)	Pages found in buffer cache	Low = memory efficient
buffers (shared read)	Pages read from disk (cache miss)	High = need more shared_buffers / buffer pool
temp written	Data spilled to temp files	> 0 = insufficient memory for sort/hash
planning time	Time spent in query optimization	> 100ms for simple query
execution time	Actual runtime	Target depends on SLA

Execution Plan Keywords by Database

Database	Keywords to Watch
----------	------------------
MySQL	`Using temporary`, `Using filesort`, `Using where` (after index), `Using index condition`
PostgreSQL	`Parallel Seq Scan`, `Materialize`, `Sort Method: external merge`, `Rows Removed by Filter`
SQL Server	`Table Scan`, `Index Scan` (not seek), `Key Lookup` (clustered), `RID Lookup`, `Spool`, `Sort`
Oracle	`TABLE ACCESS FULL` (large table), `SORT (AGGREGATE)`, `NESTED LOOPS` (high cardinality)

Step 4: Output — Markdown Table

Format every diagnosis as a Markdown table with three columns:

| Problem | Impact | Suggestion |
|---------|--------|------------|
| … | … | … |

Sort rows by severity (Critical → High → Medium → Low).

Step 5: Rewrite SQL (when applicable)

After the diagnosis table, provide a rewritten query with:

The original query in a SQL code block
The optimized query with annotations
A brief explanation of each change and its expected impact

Output Template

Always produce analysis in this structure:

1. Diagnosis

Problem	Impact	Suggestion
---------	--------	------------

2. Rewritten Query

-- Original
SELECT * FROM orders WHERE YEAR(created_at) = 2024;

-- Optimized
SELECT * FROM orders
WHERE created_at >= '2024-01-01'
  AND created_at <  '2025-01-01';

Explanation: Replaced function-on-column YEAR(created_at) with range predicate, enabling index seek on created_at. Expected: seq scan → index range scan.

3. Index Recommendations

Index	Type	Reason
-------	------	--------
`CREATE INDEX idx_orders_created ON orders(created_at)`	B-tree	Supports the range filter in WHERE
`CREATE INDEX idx_orders_user_date ON orders(user_id, created_at)`	Composite (covering)	Covers filter + sort in one scan

4. Execution Plan Summary (if plan provided)

Operator	Est. Rows	Actual Rows	Cost	Issue
----------	-----------	-------------	------	-------
Seq Scan on orders	100k	500k	12450	Full table scan — missing index on WHERE column

Index Strategy Reference

Index Type Selection

Access Pattern	Index Type	Example DDL
----------------	-----------	-------------
Equality on one column	B-tree single-column	`CREATE INDEX ON t(col)`
Equality + range	B-tree composite, equality col first	`CREATE INDEX ON t(a, b)` for `WHERE a=1 AND b>5`
Multiple equality filters	B-tree composite, most selective first	`CREATE INDEX ON t(a, b)` for `WHERE a=1 AND b=2`
Sort order match	B-tree with matching ASC/DESC	`CREATE INDEX ON t(col DESC)` for `ORDER BY col DESC`
Prefix / case-insensitive search	Expression index or trigram GiST	`CREATE INDEX ON t(LOWER(col))`
Full-text search	GIN (PG) / FULLTEXT (MySQL)	`CREATE FULLTEXT INDEX ON t(col)`
JSON / JSONB field access	GIN, or B-tree on expression	`CREATE INDEX ON t((json_col->>'field'))`
Geospatial queries	GiST (PG) / SPATIAL (MySQL)	`CREATE INDEX ON t USING GIST(geo)`
Covering query	B-tree with INCLUDE	`CREATE INDEX ON t(a) INCLUDE (b, c)` (PG)
Partial / filtered	B-tree with WHERE clause	`CREATE INDEX ON t(col) WHERE active=true`

Cardinality & Selectivity

Selectivity	Rows Matched	Recommended Action
-------------	-------------	-------------------
High	> 10%	Index unlikely to help; consider scan + partition pruning
Moderate	1–10%	Composite index may help; verify with EXPLAIN
Selective	< 1%	Index very effective; ensure leading column matches filter

Index Maintenance

Issue	Impact	Fix
-------	--------	-----
Missing index on FK column	Nested loop join → seq scan on inner table	Add index on FK column
Redundant indexes	Extra write overhead, cache pressure	Drop unused indexes (check `pg_stat_user_indexes` / `sys.dm_db_index_usage_stats`)
Index fragmentation (B-tree page splits)	More IO per scan, larger index size	Rebuild / REINDEX during low load
Over-indexed table (many indexes on one table)	Slow DML (each index updated per row)	Consolidate: composite index can replace multiple single-col indexes
Stale index statistics	Optimizer chooses suboptimal plan	`ANALYZE` (PG) / `UPDATE STATISTICS` (SQL Server) / `OPTIMIZE` (MySQL)

Database-Specific Optimization

MySQL

EXPLAIN ANALYZE SELECT ... \G          -- MySQL 8.0.18+, actual runtime

Using index = covering index, no table access — good
Using where = rows filtered after storage engine read — check selectivity
Using temporary = temp table created — avoid for large data
Using filesort = sort outside index — add composite index covering ORDER BY
Using index condition = index condition pushdown (ICP) — good, MySQL 5.6+
Impossible WHERE = WHERE always false — dead code
Covering index: If all queried columns are in the index, MySQL avoids heap lookup
MRR (Multi-Range Read): Sorts by PK before lookup — reduces random IO
BKA (Batched Key Access): Batches join lookups — reduces round trips
Buffer Pool: INNODB_BUFFER_POOL_SIZE should fit working set

PostgreSQL

EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT ...;

rows vs actual rows mismatch (> 10x) → stale stats, run ANALYZE
Parallel Seq Scan = normal for tables > 1 GB with sufficient max_parallel_workers_per_gather
Sort Method: external merge Disk → increase work_mem
CTEs are optimization fences (PG < 12): not inlined, always materialized. PG 12+ inlines CTEs, PG 16+ allows MATERIALIZED / NOT MATERIALIZED hint
NOT IN on nullable column → always returns 0 rows. Use NOT EXISTS
Autovacuum health critical: bloat → more page reads per scan
pg_stat_user_tables.n_dead_tup > 20% of n_live_tup → vacuum behind
pg_stat_user_indexes → check idx_scan to find unused indexes
pg_stat_statements → identify most time-consuming queries

SQL Server

SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO
SELECT ...;

PAGEIOLATCH_SH wait type = disk IO bottleneck (insufficient buffer pool)
LCK_M_* wait types = blocking / lock contention
CXPACKET wait = parallelism skew (one thread slower than rest)
Table Scan vs Clustered Index Scan — both full scans, but CI scan may be narrower if clustered PK is narrow
Key Lookup (clustered) / RID Lookup (heap) = non-covering index → expensive per-row lookups. Add INCLUDE columns
Missing index suggestions → green text in graphical plan, or sys.dm_db_missing_index_details
Parameter sniffing → plan cached for first parameter's distribution. Fix: OPTION (RECOMPILE) or OPTIMIZE FOR UNKNOWN
NOLOCK hint = dirty reads. Prefer READ COMMITTED SNAPSHOT database setting

Oracle

-- After query runs, get actual plan:
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY_CURSOR(FORMAT=>'ALLSTATS LAST +OUTLINE'));

TABLE ACCESS FULL — not always bad; Oracle multiblock reads are efficient for large tables
INDEX UNIQUE SCAN → ideally 1 row returned
INDEX RANGE SCAN → check number of rows scanned
NESTED LOOPS effective when driving set is small (< 10k rows) and inner table has unique index access
HASH JOIN preferred for large unindexed joins
SORT (AGGREGATE) without GROUP BY → aggregates all rows, verify cardinality first
Buffer gets vs consistent gets in AUTOTRACE TRACEONLY STATISTICS
SQL Profile / SQL Plan Baseline — lock in good plan without code changes
Star transformation — automatic rewrite for star-schema joins

Common Anti-Pattern Quick-Fix

Anti-Pattern	Quick Fix	Expected Gain
-------------	-----------	---------------
`WHERE YEAR(col) = 2024`	`WHERE col >= '2024-01-01' AND col < '2025-01-01'`	Index seek replaces seq scan
`WHERE LIKE '%keyword'`	Add full-text index + `MATCH ... AGAINST` / `@@`	Eliminate full scan
`OFFSET 10000 LIMIT 20`	Keyset: `WHERE id > 10000 LIMIT 20`	Skip 10000 discarded row reads
`SELECT *` in OLTP	Explicit column list only needed columns	Less IO, less network, less memory
`IN (SELECT ...)` (large set)	`EXISTS (SELECT 1 ...)`	Short-circuits on first match
`ORDER BY RAND()`	`TABLESAMPLE` / `WHERE id IN (random selection)`	Full scan replaced by direct access
Large `UPDATE/DELETE` in one shot	Batch in chunks of 1000 rows	Reduce lock duration, log growth, blocking
`OR` across different indexed cols	`UNION ALL` (each branch uses its own index)	Index seek per branch instead of full scan
`HAVING col = val` (no aggregate)	Move condition to `WHERE`	Filter before GROUP BY, fewer rows to aggregate
Correlated subquery	Rewrite as JOIN or window function	Eliminate row-by-row execution
`NOT IN (subquery)`	`NOT EXISTS`	Safe with NULLs, often faster
Multiple CTE references	Materialize explicitly or use temp table	Avoid repeated evaluation

Examples

Example 1: Function on Column

Query:

SELECT * FROM orders
WHERE YEAR(created_at) = 2024
ORDER BY total_amount DESC;

Diagnosis:

Problem	Impact	Suggestion
---------	--------	------------
`YEAR(created_at)` prevents index usage	Full table scan forced, index on `created_at` unusable	Replace with range predicate: `created_at >= '2024-01-01' AND created_at < '2025-01-01'`
`ORDER BY total_amount` not indexed	Filesort on all matching rows (~500k)	Add composite index: `(created_at, total_amount)`
`SELECT *` returns all columns	Extra IO on large text/blob columns	List only `id, user_id, total_amount, status`

Rewritten:

SELECT id, user_id, total_amount, status
FROM orders
WHERE created_at >= '2024-01-01'
  AND created_at <  '2025-01-01'
ORDER BY total_amount DESC;

Index:

CREATE INDEX idx_orders_created_amount ON orders(created_at, total_amount DESC);

Example 2: Missing Join Index