Tidy Excel Data
Overview
Use this skill to turn messy spreadsheet data into a clean, auditable workbook. Favor a reversible workflow: inspect first, preserve raw data when useful, clean deterministically, and report every meaningful assumption.
Workflow
- Confirm the user's goal and output format when it is not obvious.
- Inspect workbook structure: sheet names, dimensions, merged cells, header rows, blank rows or columns, data types, duplicate candidates, and obvious ID/date/amount fields.
- Choose the safest cleaning level:
- Light cleanup: trim text, remove empty rows/columns, normalize headers, preserve all records.
- Standard cleanup: light cleanup plus duplicate removal and common type coercion.
- Custom cleanup: apply user-provided rules for columns, filters, joins, pivots, or sheet splitting.
- Use
scripts/tidy_excel.py for repeatable workbook cleanup when it fits the task. - For complex business rules, write a short task-specific script or notebook-like Python snippet, then save the cleaned workbook and a QA summary.
- Verify the output by reopening the generated workbook or reading key sheets back with pandas/openpyxl.
Quick Start
Run the bundled script for common cleanup:
python scripts/tidy_excel.py input.xlsx --output cleaned.xlsx
Useful options:
python scripts/tidy_excel.py input.xlsx --output cleaned.xlsx --dedupe --coerce-types
python scripts/tidy_excel.py input.xlsx --output cleaned.xlsx --sheet Orders --summary qa_summary.csv
python scripts/tidy_excel.py input.csv --output cleaned.xlsx --header-row 2
The script writes cleaned sheets to an Excel workbook and adds a _tidy_summary sheet with row counts, column counts, duplicate counts, missing-value counts, and inferred column types.
Cleaning Rules
Apply these defaults unless the user gives different instructions:
- Preserve the original file; write a new workbook instead of overwriting.
- Normalize headers to lowercase
snake_case; keep names unique with numeric suffixes. - Remove fully empty rows and columns.
- Trim leading/trailing whitespace in text cells and collapse repeated internal whitespace.
- Treat common empty markers (
n/a, na, null, none, -, empty strings) as missing values. - Do not delete partial rows unless the user asks for filtering.
- Deduplicate only when requested or when rows are exact duplicates and the user asked for "clean/organize".
- Keep leading-zero identifiers as text unless the user explicitly wants numeric conversion.
- Avoid destructive date or currency conversion when formats are ambiguous.
Data Quality Checks
Use references/quality-checks.md when deciding what to inspect or report. Include a short QA note in the final response covering:
- workbook and sheet names processed
- rows/columns before and after cleanup
- duplicate rows removed, if any
- columns with high missing rates
- assumptions or conversions applied
- output file path
Custom Tasks
For merges, pivots, category mapping, multi-row headers, or domain-specific rules, inspect the workbook first and adapt the workflow. Prefer pandas for tabular transformations and openpyxl when preserving workbook formatting, formulas, or sheet-level structure matters.