> Drop a folder of CSVs, Excels, and JSONs from 5 different teams; get back a single clean table, a deduplication report, and a data-quality scorecard. No manual schema mapping required.
>
> 把 5 个部门各种格式的 CSV/Excel/JSON 一起扔进来,自动给你一张干净统一表、去重报告、数据质量评分。无需手工配字段映射。
Trigger keywords (中文): 清洗数据、数据清洗、合并数据、去重、缺失值、字段对齐、schema 合并、数据质量、数据预处理、ETL
Trigger keywords (EN): clean data, data cleaning, deduplicate, missing values, schema reconcile, ETL, data quality, profile dataset
Supported sources:
| 格式 / Format | 说明 |
|---|---|
| --- | --- |
| CSV / TSV | Auto-detect encoding (UTF-8/GBK/BIG5), delimiter, quote char, header row |
| Excel (.xlsx/.xls/.xlsm) | Multi-sheet, merged cells, formula values |
| JSON / JSONL / NDJSON | Nested structures auto-flattened |
| Parquet / Feather | Native columnar reading |
| SQL dumps (.sql) | MySQL / PostgreSQL INSERT extraction |
| Log files | Pattern-detected structured lines |
Do NOT use when:
python3 scripts/profile.py --input <file-or-dir> --out profile.json
For each source produces:
scripts/normalize_types.py standardizes:
2024-03-15, 2024/3/15, 15 Mar 2024, 民国113年3月15日, Excel serial) → ISO 8601Y/N/是/否/0/1/true/false/T/F/✓/✗ → booleanPer-column strategy (configurable in templates/missing_strategy.json):
drop_row — drop rows where this column is nullmean|median|mode — statistical imputation (with imputation flag column)constant: — fill with literalforward_fill — for time-seriesinterpolate — linear/spline for numeric serieskeep_null — preserve as null (default for unknown)Critical rule: every imputed value gets a sidecar boolean column so downstream analysis can distinguish original vs. imputed data.
scripts/reconcile_schema.py aligns columns across sources using:
--mapping mapping.yaml)Outputs a crosswalk.json documenting every column mapping for audit.
scripts/dedup.py uses configurable blocking + record linkage:
Reports merge groups for human review before commit.
Per CLEANER_PII_POLICY:
keep — leave as-is (use only with explicit user authorization)mask — partial mask (王三, 1385678, 4400*1234)drop — remove column entirelyAuto-detection of common PII: 姓名、身份证号、手机号、邮箱、地址、银行卡号、IP、车牌号。
python3 scripts/quality_report.py --input cleaned.parquet --out dq_report.md
Six dimensions (per DAMA-DMBOK):
Each scored 0-100 with drill-down detail.
output/
├── cleaned.parquet # main clean dataset (or .csv if requested)
├── crosswalk.json # source → target schema mapping
├── dedup_groups.json # merged record groups for review
├── dq_report.md # human-readable data quality report
├── dq_report.json # machine-readable DQ metrics
├── audit/
│ ├── per_source_profile.json
│ ├── imputation_log.csv
│ └── pii_actions.log
└── provenance.csv # row-level lineage: which source each row came from
audit/.keep, PII is masked.> 不静默丢数据,所有删除/合并/填充均记录到 audit/;填充值带标志列防止假冒原值;隐私字段默认脱敏;原始文件不修改;模糊去重低置信度合并强制人工复核;不向外部上传任何数据。
python3 scripts/run_pipeline.py \
--input sales_q1.csv \
--output-dir ./cleaned_q1/ \
--pii-policy mask
python3 scripts/run_pipeline.py \
--input ./customer_sources/ \
--output-dir ./unified_customers/ \
--dedup-keys name,phone \
--priority-source crm_export.csv
python3 scripts/run_pipeline.py \
--input ./multi_team_data/ \
--mapping mapping.yaml \
--output-dir ./unified/
mapping.yaml:
target_schema:
customer_id: { aliases: [客户ID, cust_id, ClientID, 编号] }
phone: { aliases: [手机, 联系电话, Mobile, tel] }
signup_date: { aliases: [注册日期, 开户日期, CreatedAt], type: date }
python3 scripts/profile.py --input ./suspicious_dataset/ --out dq_audit.md --read-only
cd tests && python3 -m pytest -v
Fixtures include:
pandas, pyarrow, recordlinkage library docsdata ETL data-cleaning dedup schema-reconcile data-quality 数据清洗 多源整合 去重 数据质量
共 1 个版本