v0.5.0
A small honest toolkit for the work agents end up doing constantly: read a CSV someone sent you, work out what's in it, clean it up, and forward only the safe rows downstream. Built on Python 3 standard library only. No pandas, no numpy, no pip installs, no remote calls.
scripts/inspect.py — profile a .csv / .tsv / .jsonl file: row count, auto-detected column types (int, float, bool, date, datetime, string, empty), null counts per column, distinct value counts (capped), three sample values per column, file size, and detected encoding.scripts/validate.py — check the file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). Exits 0/1 so it slots into CI.scripts/dedupe.py — remove duplicate rows by full-row match or by key columns. Optional --keep first|last, --case-insensitive, --trim, and a JSONL report of every removed row.scripts/diff.py — compare two files by key column(s) and classify every row as added / removed / changed / unchanged, with a per-column before/after diff for changed rows.scripts/convert.py — convert between CSV, TSV, JSON Lines, JSON array, and GitHub-flavored Markdown table.scripts/head.py (NEW in v0.2.0) — print the first N rows in csv / tsv / jsonl / md / aligned format, with optional column subset.scripts/tail.py (NEW in v0.2.0) — print the last N rows using a streaming ring buffer (works on multi-gigabyte files without loading them).scripts/sample.py (NEW in v0.2.0) — pick a uniformly random sample of N rows via reservoir sampling. Single-pass, O(N) memory, optional --seed for reproducibility, optional --preserve-order to keep original row order.scripts/merge.py (NEW in v0.3.0) — join two files on one or more key columns. Supports inner / left / right / outer joins, separate key names per side via --left-on and --right-on, and duplicate-column disambiguation via --suffix-left / --suffix-right. Streams the LEFT side, indexes the RIGHT side (peak memory ≈ size of right file).scripts/pivot.py (NEW in v0.3.0) — group-by aggregations and wide pivot tables. Aggregations: count, sum, avg/mean, min, max, first, last, nunique. Set --pivot-on COL to produce a wide cross-tab (e.g. region × product, sum of revenue). Numeric-aware --sort-by so --sort-by revenue_sum --desc orders correctly.scripts/filter.py (NEW in v0.4.0) — keep rows that match a safe predicate (amount > 100, status in pending,approved, email =~ @example\.com$, name is_not_empty). Supports ==, !=, <, <=, >, >=, =~ (regex), in, contains, is_empty / is_not_empty / is_number / is_not_number. and / or / parentheses / not. NO Python eval — a hand-rolled tokenizer + recursive-descent parser. Optional --invert, --limit, --columns.scripts/sort.py (NEW in v0.4.0) — stable, type-aware sort. Auto-detects which columns are numeric and sorts them numerically (so 1200 > 899 > 100 > 50, not "50" > "1200"). Per-column direction with --by amount:desc,region:asc. Optional --case-insensitive, --limit, --numeric (force numeric on all sort cols).scripts/concat.py (NEW in v0.4.0) — stack files vertically (UNION ALL). Default mode unions the headers of all inputs; --strict requires identical headers; --add-source COL tags each row with its source filename; --dedupe drops exact-duplicate rows across inputs. Streams one file at a time.scripts/transform.py (NEW in v0.5.0) — add, modify, rename, drop, cast, or keep columns. Derived columns via a safe expression language (no eval): --add 'profit = revenue - cost', --add 'full_name = upper(first) + " " + upper(last)', --add 'year = year(signup_date)', --add 'safe = coalesce(value, default)'. Built-in functions: upper, lower, strip, len, abs, round, int, float, str, replace, split, join, coalesce, year, month, day. Chainable with --cast COL:int|float|bool|string, --rename OLD=NEW, --drop COL[,...], --keep COL[,...].scripts/check_deps.sh — verify python3 is available.bash scripts/check_deps.sh
Only python3 is required. The skill uses csv, json, re, pathlib, argparse, datetime, collections — all stdlib.
# First 10 rows in a clean aligned table
python3 scripts/head.py mystery.csv
# Last 5 rows of a multi-GB log
python3 scripts/tail.py huge.csv -n 5
# A reproducible random sample for spot-checking
python3 scripts/sample.py customers.csv -n 20 --seed 42
# Preview only specific columns
python3 scripts/head.py customers.csv --columns id,email,status
# Emit a previewable Markdown table for an agent's reply
python3 scripts/head.py customers.csv -n 5 --as md
All three scripts accept -n N, --as csv|tsv|jsonl|md|aligned, --output file, and --columns col1,col2,.... sample.py additionally accepts --seed INT and --preserve-order. Default output format is aligned — a fixed-width text table sized to the actual data, which is what an agent usually wants to show inline. Default N is 10.
Streaming guarantees:
head.py reads at most N+1 rows from the file.tail.py keeps a bounded deque(maxlen=N) and emits only the last N rows.sample.py uses reservoir sampling (algorithm R): single pass, O(N) memory regardless of file size.On a 100,000-row / 1.6 MB CSV: head -n 3 runs in ~50 ms, tail -n 3 in ~180 ms, sample -n 5 in ~260 ms.
python3 scripts/inspect.py customers.csv
Output:
file: /path/customers.csv
size: 284 B (284 bytes)
encoding: utf-8
kind: csv
rows: 5
columns: 6
# name type nulls null% distinct sample
----------------------------------------------------------------------------------------------------
1 id int 0 0.00 5 '1', '2', '3'
2 email string 0 0.00 5 'alice@example.com', ...
3 name string 0 0.00 5 'Alice', 'Bob', 'Carol'
4 amount float 1 20.00 4 '42.50', '100.00', '7.25'
5 status string 0 0.00 3 'approved', 'pending', ...
6 signup_date date 0 0.00 5 '2025-01-15', ...
Pass --json for machine-readable output that pipes into other tools.
The script auto-detects the dialect (CSV vs TSV vs JSON Lines) and a sensible encoding (utf-8, utf-8-sig, cp1252, latin-1). Type inference takes up to 1000 non-empty values per column and picks the most specific type that fits all of them.
Write a schema.json:
{
"required_columns": ["id", "email", "amount", "status"],
"columns": {
"id": {"type": "int", "required": true, "unique": true, "min": 1},
"email": {"type": "string", "required": true, "regex": ".+@.+\\..+"},
"amount": {"type": "float", "min": 0, "max": 100000},
"status": {"type": "string", "enum": ["pending", "approved", "rejected"]},
"signup_date": {"type": "date"}
}
}
Then:
python3 scripts/validate.py customers.csv --schema schema.json
A clean file exits 0 with verdict: pass. A bad file exits 1 with a detailed error table:
row column kind detail
------------------------------------------------------------------------------------------------
2 email regex_mismatch value did not match regex | value='not-an-email'
2 amount bad_type value does not match type 'float' | value='abc'
3 amount below_min value -50.0 < min 0 | value='-50.00'
3 status not_in_enum value not in allowed set | value='unknown_status'
4 id duplicate_unique value already seen earlier in this column | value='1'
Pass --json for a structured report and --max-errors N to cap collection on huge files.
By full-row match (any two rows identical in every column):
python3 scripts/dedupe.py messy.csv clean.csv
By a key column (only one canonical row per id):
python3 scripts/dedupe.py messy.csv clean.csv --key id \
--removed-report removed.jsonl
--keep first (default) keeps the earlier-occurring row; --keep last keeps the later one — useful when later rows are corrections. --case-insensitive and --trim normalise key values before comparison so " alice@example.com" and "ALICE@example.com" collapse to one row.
The --removed-report writes one JSON object per removed row, with the original 1-based row index, the key tuple that was duplicated, and the full row, so the dedup decision is auditable.
python3 scripts/diff.py customers_old.csv customers_new.csv --key id
Output:
added: 1
removed: 1
changed: 1
--- ADDED (1) ---
+ 6
--- REMOVED (1) ---
- 4
--- CHANGED (1) ---
~ 2
amount: '100.00' -> '150.00'
status: 'pending' -> 'approved'
Multi-column keys are supported: --key customer_id,date. Exit codes are 0 if the files are identical on the key columns, 1 if they differ — so this also works as a CI guard ("fail the build if the snapshot file changed").
python3 scripts/convert.py data.csv data.jsonl # row -> JSON Lines
python3 scripts/convert.py data.jsonl data.csv # back
python3 scripts/convert.py data.csv data.json --pretty
python3 scripts/convert.py data.csv data.md # GitHub-flavored table
python3 scripts/convert.py data.tsv data.csv # delimiter change
Output format is picked from the extension. Allowed extensions: .csv, .tsv, .jsonl, .json, .md. The Markdown writer escapes | and \n in cell values so the table stays well-formed.
# Inner join: only users with at least one order
python3 scripts/merge.py users.csv orders.csv joined.csv \
--left-on id --right-on user_id
# Left join: keep every user, fill unmatched with empty strings
python3 scripts/merge.py users.csv orders.csv left.csv \
--left-on id --right-on user_id --how left
# Same key name on both sides: --on shorthand
python3 scripts/merge.py users.csv orders.csv out.csv --on user_id
# Outer join into JSON Lines, machine-readable summary on stdout
python3 scripts/merge.py a.csv b.csv full.jsonl --on key --how outer --json
Duplicate non-key columns are auto-renamed with --suffix-left / --suffix-right (defaults _x / _y).
# Sum revenue per region
python3 scripts/pivot.py sales.csv by_region.csv \
--group-by region --agg revenue:sum --sort-by revenue_sum --desc
# Multiple aggregations per group
python3 scripts/pivot.py sales.csv detail.csv \
--group-by region,product \
--agg "units:sum,revenue:sum,revenue:avg,product:nunique"
# Wide cross-tab: region × product matrix of revenue
python3 scripts/pivot.py sales.csv crosstab.csv \
--group-by region --pivot-on product --agg revenue:sum --fill 0
# Same wide pivot rendered as Markdown for a report
python3 scripts/pivot.py sales.csv crosstab.md \
--group-by region --pivot-on product --agg revenue:sum --fill "-"
Aggregation functions: count, sum, avg/mean, min, max, first, last, nunique. Output column names follow (e.g. revenue_sum). --sort-by is numeric-aware: numeric columns are ordered numerically, string columns lexicographically.
# Numeric comparison
python3 scripts/filter.py orders.csv big.csv --where "amount > 100"
# Combine boolean conditions
python3 scripts/filter.py orders.csv top.csv \
--where "status == approved and amount >= 50"
# Set membership (commas are part of the value, not separators)
python3 scripts/filter.py users.csv targeted.csv \
--where "country in IN,US,UK and signup_year >= 2024"
# Regex match
python3 scripts/filter.py users.csv company.csv \
--where 'email =~ @example\.com$'
# Null / type checks (no right-hand side)
python3 scripts/filter.py users.csv missing.csv --where "phone is_empty"
# Invert the predicate, write only specific columns, cap at N matches
python3 scripts/filter.py log.csv non_errors.csv \
--where "level == ERROR" --invert --columns ts,msg --limit 1000
The expression language is deliberately small and is parsed by a hand-rolled tokenizer + recursive-descent parser. There is no eval, no shell, no subprocess.
# Auto-numeric sort, descending
python3 scripts/sort.py sales.csv s.csv --by amount:desc
# Multi-key: country ascending, signup_date descending (stable)
python3 scripts/sort.py users.csv s.csv --by country:asc,signup_date:desc
# Top 10 by revenue
python3 scripts/sort.py sales.csv top10.csv --by revenue:desc --limit 10
# Case-insensitive string sort
python3 scripts/sort.py contacts.csv s.csv --by name --case-insensitive
Each --by column is treated numerically when every value parses as a number, otherwise string. --numeric forces numeric on all sort columns (non-numeric rows sort last).
# Stack monthly shards into one CSV (header union)
python3 scripts/concat.py all_quarter.csv jan.csv feb.csv mar.csv
# Strict mode: require every input to have an identical header
python3 scripts/concat.py all.csv jan.csv feb.csv mar.csv --strict
# Tag each row with the source filename (without extension)
python3 scripts/concat.py tagged.csv shard_*.csv --add-source origin --source-stem
# Stack + drop duplicate rows across files
python3 scripts/concat.py all.csv jan.csv feb.csv apr.csv --dedupe
# Add a derived column
python3 scripts/transform.py orders.csv with_profit.csv \
--add 'profit = revenue - cost'
# Multiple --add operations + cast + final column selection
python3 scripts/transform.py sales.csv clean.csv \
--add 'profit = revenue - cost' \
--add 'margin_pct = round(profit / revenue * 100, 1)' \
--add 'name = lower(strip(first_name)) + "_" + lower(strip(last_name))' \
--add 'signup_year = year(signup)' \
--cast revenue:float --cast cost:float \
--keep id,name,country,revenue,profit,margin_pct,signup_year
# Rename and drop
python3 scripts/transform.py users.csv clean.csv \
--rename 'signup=joined_date' --drop password_hash
# Boolean comparisons produce 0/1 columns
python3 scripts/transform.py orders.csv flagged.csv \
--add 'is_high_value = amount > 1000'
# Fallback for missing values
python3 scripts/transform.py users.csv filled.csv \
--add 'safe_email = coalesce(email, "unknown@example.com")'
The expression language is intentionally small: arithmetic (+ - * / %), string concat (+), comparisons (== != < <= > >= → yield 0/1), parentheses, identifiers (column references), string and number literals, and function calls. No eval, no subprocess, no shell. Empty cells that propagate into arithmetic leave the derived value empty for that row instead of crashing the pipeline.
| Code | Meaning |
|---|---|
| --- | --- |
| 0 | success / validation pass / files identical |
| 1 | validation fail / files differ / no rows in input |
| 2 | bad arguments / unsafe path / missing input / unsupported extension / schema malformed |
This 0/1/2 split is consistent across all five scripts, so they slot into shell pipelines cleanly:
python3 scripts/validate.py incoming.csv --schema schema.json \
&& python3 scripts/dedupe.py incoming.csv clean.csv --key id \
&& python3 scripts/inspect.py clean.csv
subprocess calls. No shell invocation.;, |, &, >, <, $, ` ``, backslash-newline, etc.).inspect.py profiles 10,000 rows in well under one second on a single core (single-pass streaming read).dedupe.py and diff.py, which build an in-memory dict keyed by row identity — fine for hundreds of thousands of rows on a typical laptop."1,234.56" is detected as string, not float. Re-export with a different number format if you need different inference.scripts/transform.py: derived columns + schema operations. Hand-rolled tokenizer + recursive-descent parser (no eval, no subprocess) supports arithmetic (+/-/* / / %), string concat (+), parentheses, function calls, and boolean comparisons that yield 0/1. Built-in functions: upper, lower, strip, len, abs, round, int, float, str, replace, split, join, coalesce, year, month, day. Six op kinds: --add, --set, --drop, --rename, --cast, --keep. Schema is computed symbolically before the streaming pass, so empty cells in arithmetic columns don't crash the whole pipeline — they just leave the derived value empty for that single row.scripts/filter.py: safe-predicate row filter. Hand-rolled tokenizer + recursive-descent parser, NO eval and no subprocess. Numeric and string compare, regex (=~), in COMMA,LIST, contains, is_empty / is_not_empty / is_number / is_not_number. Boolean and / or / not with parentheses. --invert, --limit, --columns.scripts/sort.py: type-aware stable sort with per-column direction (--by amount:desc,region:asc). Auto-detects numeric columns. Optional --case-insensitive, --limit, --numeric. 100k rows sorted in ~0.3 s.scripts/concat.py: vertical UNION ALL of multiple CSV / TSV / JSONL files. Header union by default, --strict for exact-match check, --add-source to tag rows, --dedupe to drop exact-duplicate rows. Streams one input at a time, memory does not grow with the number of inputs (unless --dedupe is set).scripts/merge.py: join two CSV / TSV / JSONL files on one or more key columns. Supports inner, left, right, outer joins; separate key names per side; duplicate-column suffixing; CSV / TSV / JSONL output. Single-pass over LEFT, peak memory ≈ size of RIGHT. Merged 50k × 200k rows in ~1.3 s.scripts/pivot.py: group-by aggregations and wide pivot tables. Functions: count, sum, avg, min, max, first, last, nunique. Wide mode produces region × product cross-tabs. Numeric-aware sort. Streamed 100k rows in ~0.7 s.Three new preview helpers (head.py, tail.py, sample.py):
head.py and tail.py give shell-style preview that is format-aware and never mangles quoting the way a naive head / tail would. They auto-detect dialect (csv/tsv/jsonl), let you pick the output format with --as, and can re-emit any subset of columns with --columns.sample.py runs reservoir sampling (algorithm R): a single streaming pass, O(N) memory regardless of file size. --seed INT makes the sample reproducible so it slots into test suites and CI; --preserve-order re-sorts the reservoir back into original row order.--as csv|tsv|jsonl|md|aligned, --output, and --columns flags, mirroring the convention already used by convert.py.aligned, a fixed-width text table that an agent can paste straight into a reply.Performance: on a 100,000-row / 1.6 MB CSV, head runs in ~50 ms, tail in ~180 ms, sample in ~260 ms.
No breaking changes: every v0.1.0 CLI flag, output format, and exit-code contract is preserved.
MIT. See LICENSE.
共 3 个版本