← 返回
未分类

Clean Text Toolkit

Local text cleanup and inspection toolkit. Extract structured items (URLs, emails, phones, IPs, dates, hashtags, money), redact PII (email/phone/credit-card-...
本地文本清理和检查工具包。提取结构化信息(URL、邮箱、电话、IP、日期、话题标签、金额),并对个人身份信息(邮箱/电话/信用卡…)进行脱敏。
gopendrasharma89-tech
未分类 clawhub v0.4.0 4 版本 100000 Key: 无需
★ 1
Stars
📥 531
下载
💾 0
安装
4
版本
#agent#dedupe#diff#extract#html#jinja-free#latest#links#markdown#normalize#pii#redact#regex#replace#sanitize#scrape#sed#slug#sort#stdlib#strip#tables#template#text#wordcount

概述

clean-text-toolkit

v0.4.0

A small, honest local toolkit for the work agents end up doing constantly: read some text someone sent you, find the structured bits, clean it up, redact the secrets, and forward it downstream. Built on Python 3 standard library only. No pandas, no nltk, no pip installs, no remote calls.

This skill is the companion to clean-csv-toolkit: that one handles structured tabular data, this one handles unstructured text.

What this skill does

  • scripts/extract.py — pull structured items out of any text file. Kinds: url, email, phone, ipv4, ipv6, hashtag, mention, hex-color, money, iso-date. Output to stdout (one-per-line or JSON), or to a .txt / .json / .jsonl file. Optional --unique, --sort, --with-line (prefix with the source line number).
  • scripts/normalize.py — clean up messy text. Chainable transforms applied in command-line order: --trim, --collapse-spaces, --strip-blank, --to-unix, --to-crlf, --dehyphenate (rejoin OCR/PDF hyphenated line-breaks), --unsmart (smart quotes / em-dashes → ASCII), --strip-bom, --strip-zwsp (zero-width spaces and joiners), --tabs-to-spaces N, --spaces-to-tabs N, --lower / --upper / --title, --normalize-unicode NFC|NFD|NFKC|NFKD.
  • scripts/redact.py — anonymize text by replacing PII-like patterns with placeholder tokens. Kinds: email, phone, ipv4, ipv6, url, credit-card (with Luhn validation to suppress false positives), ssn-us, uuid, hex-token (32+ hex chars, typical for tokens / hashes), aws-access-key (AKIA…), jwt (three base64url segments with the eyJ header). --keep-counts makes the same value always get the same placeholder; --preserve-length pads/truncates the placeholder to the original length.
  • scripts/lines.py — line-oriented utilities. --op count | dedupe | sort | shuffle | head | tail. Streams count, head, tail. dedupe and sort are O(N) memory in the number of lines, but each line is small so 1 M lines is fine on a laptop. --case-insensitive, --keep first|last, --numeric, --reverse, --seed for deterministic shuffles.
  • scripts/wordcount.py — word / character / line / sentence statistics. Optional --top N for most-frequent words, --stopwords PATH, --min-length N, --ignore-case, --regex PATTERN (default [A-Za-z']+).
  • scripts/diff_text.py — three-mode text diff using stdlib difflib. --mode unified (default), --mode side (custom two-column layout), --mode html (writes a full HTML file with red/green coloring). --ignore-case, --ignore-whitespace, --context N.
  • scripts/template.py (NEW in v0.2.0) — substitute placeholders in a text file with values from a JSON object or inline --set key=value overrides. Mustache ({{name}}), dollar (${name}), or percent (%(name)s) syntax. Filters: upper, lower, title, strip, capitalize, reverse, len, escape-html, escape-json, urlencode. Default values: {{name ?Unknown}}. Strict mode (--strict) exits 1 if any placeholder is unresolved. No Jinja2, no eval.
  • scripts/slug.py (NEW in v0.2.0) — turn strings into URL-safe slugs. Single string mode (--text "Hello World") or batch mode (line-in-file -> line-out-file). Options: --separator, --max-length, --no-lower, --ascii (Unicode -> ASCII transliteration via NFKD), --keep-dots (useful for filenames), --dedupe.
  • scripts/markdown.py (NEW in v0.2.0) — strip Markdown to plain text, render a minimal HTML approximation, or extract structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV. For text mode, --link-style anchor|url|both controls how text is rendered.
  • scripts/replace.py (NEW in v0.3.0) — find-and-replace with regex / literal / word-boundary modes, capture-group back-references (\1, \2), multiple --find/--replace pairs in a single pass, or a JSON --rules file with per-rule settings. --dry-run previews matches with line:col and context; --max N caps replacements per rule. Returns exit 1 when zero replacements happen so it slots into CI.
  • scripts/htmlstrip.py (NEW in v0.4.0) — strip HTML tags from scraped pages. Three modes: text (collapse to plain readable text, drop