v0.4.0
A small, honest local toolkit for the work agents end up doing constantly: read some text someone sent you, find the structured bits, clean it up, redact the secrets, and forward it downstream. Built on Python 3 standard library only. No pandas, no nltk, no pip installs, no remote calls.
This skill is the companion to clean-csv-toolkit: that one handles structured tabular data, this one handles unstructured text.
scripts/extract.py — pull structured items out of any text file. Kinds: url, email, phone, ipv4, ipv6, hashtag, mention, hex-color, money, iso-date. Output to stdout (one-per-line or JSON), or to a .txt / .json / .jsonl file. Optional --unique, --sort, --with-line (prefix with the source line number).scripts/normalize.py — clean up messy text. Chainable transforms applied in command-line order: --trim, --collapse-spaces, --strip-blank, --to-unix, --to-crlf, --dehyphenate (rejoin OCR/PDF hyphenated line-breaks), --unsmart (smart quotes / em-dashes → ASCII), --strip-bom, --strip-zwsp (zero-width spaces and joiners), --tabs-to-spaces N, --spaces-to-tabs N, --lower / --upper / --title, --normalize-unicode NFC|NFD|NFKC|NFKD.scripts/redact.py — anonymize text by replacing PII-like patterns with placeholder tokens. Kinds: email, phone, ipv4, ipv6, url, credit-card (with Luhn validation to suppress false positives), ssn-us, uuid, hex-token (32+ hex chars, typical for tokens / hashes), aws-access-key (AKIA…), jwt (three base64url segments with the eyJ header). --keep-counts makes the same value always get the same placeholder; --preserve-length pads/truncates the placeholder to the original length.scripts/lines.py — line-oriented utilities. --op count | dedupe | sort | shuffle | head | tail. Streams count, head, tail. dedupe and sort are O(N) memory in the number of lines, but each line is small so 1 M lines is fine on a laptop. --case-insensitive, --keep first|last, --numeric, --reverse, --seed for deterministic shuffles.scripts/wordcount.py — word / character / line / sentence statistics. Optional --top N for most-frequent words, --stopwords PATH, --min-length N, --ignore-case, --regex PATTERN (default [A-Za-z']+).scripts/diff_text.py — three-mode text diff using stdlib difflib. --mode unified (default), --mode side (custom two-column layout), --mode html (writes a full HTML file with red/green coloring). --ignore-case, --ignore-whitespace, --context N.scripts/template.py (NEW in v0.2.0) — substitute placeholders in a text file with values from a JSON object or inline --set key=value overrides. Mustache ({{name}}), dollar (${name}), or percent (%(name)s) syntax. Filters: upper, lower, title, strip, capitalize, reverse, len, escape-html, escape-json, urlencode. Default values: {{name ?Unknown}}. Strict mode (--strict) exits 1 if any placeholder is unresolved. No Jinja2, no eval.scripts/slug.py (NEW in v0.2.0) — turn strings into URL-safe slugs. Single string mode (--text "Hello World") or batch mode (line-in-file -> line-out-file). Options: --separator, --max-length, --no-lower, --ascii (Unicode -> ASCII transliteration via NFKD), --keep-dots (useful for filenames), --dedupe.scripts/markdown.py (NEW in v0.2.0) — strip Markdown to plain text, render a minimal HTML approximation, or extract structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV. For text mode, --link-style anchor|url|both controls how text is rendered.scripts/replace.py (NEW in v0.3.0) — find-and-replace with regex / literal / word-boundary modes, capture-group back-references (\1, \2), multiple --find/--replace pairs in a single pass, or a JSON --rules file with per-rule settings. --dry-run previews matches with line:col and context; --max N caps replacements per rule. Returns exit 1 when zero replacements happen so it slots into CI.scripts/htmlstrip.py (NEW in v0.4.0) — strip HTML tags from scraped pages. Three modes: text (collapse to plain readable text, drop / content, preserve line breaks at block tags), html (sanitize — remove script,style,iframe,object,embed,form,input tags + all on* event-handler attributes + inline style=, keep the rest intact), extract (pull links/images/headings/tables as JSON/JSONL/TSV). Built on Python stdlib html.parser. The single most-asked-for agent capability: turn scraped HTML into something useful in one command.scripts/check_deps.sh — verify python3 is available.extract, lines --op count|head|tail, wordcount for chars/lines counters) read one line at a time.python3 scripts/extract.py app.log --kind email --unique --sort
python3 scripts/extract.py app.log --kind email --output emails.txt --unique
python3 scripts/extract.py article.md --kind url --with-line
python3 scripts/normalize.py scanned.txt clean.txt \
--strip-bom --to-unix --dehyphenate --collapse-spaces \
--unsmart --strip-blank --normalize-unicode NFC
The transforms run in the order you list them on the command line.
python3 scripts/redact.py transcript.txt safe.txt
# default kinds = all
# default placeholder = [REDACTED_{kind}_{i}]
# Only redact emails and phones, give the same email the same placeholder
python3 scripts/redact.py transcript.txt safe.txt \
--kinds email,phone --keep-counts
# Custom template
python3 scripts/redact.py log.txt safe.txt \
--token-template "<<{kind}#{i}>>"
# Pad placeholder to match original length (for fixed-width layouts)
python3 scripts/redact.py log.txt safe.txt --preserve-length
Credit-card matches are validated against the Luhn checksum so 16 random digits in a row don't trigger a false positive.
# Quick file stats
python3 scripts/lines.py haystack.txt --op count
# Drop duplicates, case-insensitive
python3 scripts/lines.py users.txt --op dedupe --case-insensitive --output unique.txt
# Numeric sort (so "100" > "23" > "7")
python3 scripts/lines.py scores.txt --op sort --numeric --reverse
# Deterministic shuffle
python3 scripts/lines.py prompts.txt --op shuffle --seed 42
# Look at the head and tail of a multi-gig log
python3 scripts/lines.py huge.log --op head -n 20
python3 scripts/lines.py huge.log --op tail -n 20
# Basic stats
python3 scripts/wordcount.py essay.txt
# Top words with stopwords filter
python3 scripts/wordcount.py essay.txt --top 20 --ignore-case --stopwords stop.txt
# Machine-readable output
python3 scripts/wordcount.py essay.txt --top 10 --json > stats.json
# Standard unified diff
python3 scripts/diff_text.py before.txt after.txt
# Side-by-side
python3 scripts/diff_text.py before.txt after.txt --mode side
# HTML report (colorized) for sharing
python3 scripts/diff_text.py before.txt after.txt --mode html --output diff.html
# Whitespace-insensitive compare
python3 scripts/diff_text.py before.txt after.txt --ignore-whitespace
| Code | Meaning |
|---|---|
| --- | --- |
| 0 | success / one or more matches / files identical |
| 1 | zero matches / zero redactions / files differ / empty input |
| 2 | bad arguments / unsafe path / missing input / unknown kind / bad regex / unsupported output extension |
This 0 / 1 / 2 split is consistent across all six scripts so they slot into shell pipelines cleanly:
# Normalize, then redact, then count words in one shot
python3 scripts/normalize.py raw.txt clean.txt --to-unix --dehyphenate \
&& python3 scripts/redact.py clean.txt safe.txt \
&& python3 scripts/wordcount.py safe.txt --top 10
pip install.subprocess calls. No shell invocation.;, |, &, >, <, $, ` `, etc.). The same safe_path() helper that powers clean-csv-toolkit`.utf-8-sig, cp1252, latin-1 if needed. Writes are always UTF-8.shuffle --seed N is reproducible; extract and wordcount always emit results in the same order for a given input.lines.py --op dedupe processes 100,000 short lines (500 distinct) in ~0.06 s.lines.py --op sort processes 100,000 lines in ~0.10 s.extract.py scans the file in a single streaming pass — memory does not grow with file size.email regex accepts user@host.tld shapes but does not validate that host.tld resolves. phone accepts three telltale formats (+, (XXX) XXX-XXXX, XXX-XXX-XXXX / XXX XXX XXXX) so it doesn't grab IPs, dates, or credit-card numbers — but it will miss exotic local formats.credit-card uses the Luhn checksum, but hex-token (and similar high-recall patterns) intentionally over-match; review the count before sharing redacted output publicly.diff_text.py --mode html produces the standard difflib.HtmlDiff markup, which embeds inline styles. The file is portable but the styling is not customizable.scripts/htmlstrip.py: HTML → plain text / sanitized HTML / structured extract. Built on stdlib html.parser. Three modes (text / html / extract), keeps links optionally, drops