Tool repository: https://github.com/zixixr/paperconan
Use this skill when the user:
.xlsx / .csv / .tsv) and asks for a data integrity / sanity check.xlsx / .csv / .tsv only (not .xls, PDFs, or images)pip install paperconan # base install (xlsx / csv / tsv)
pip install "paperconan[all]" # + supplementary PDF / Word table extraction
# Dev install from a clone: pip install -e /path/to/paperconan
Verify with paperconan --help (or paperconan --version).
A complete worked example — synthetic data + the report it produces + a guided
walkthrough of every finding — lives in the repo's
examples/ directory. Read it to see the output
shape before running on real data.
Single command, takes a directory of data tables (.xlsx / .csv / .tsv):
paperconan <input-dir>
# Default output: <input-dir>/audit/scan.json + <input-dir>/audit/report.html
Common variants:
paperconan <input-dir> --out /tmp/audit-X # custom output dir
paperconan <input-dir> --md # also write REPORT.md
paperconan <input-dir> --no-html # only scan.json (CI / scripted use)
Exit code is 0 even when findings are present — findings are not errors.
If the user gives a paper (DOI or title) instead of a local directory:
paperconan fetch "<DOI or title>" # list candidate datasets + match signals
paperconan fetch "<DOI>" --download <cand_id> --out data/ # download chosen candidate's tabular files
paperconan data/ # then audit as usual
Workflow:
paperconan fetch "" . Each candidate has match_signals (doi_in_related, title_overlap, author_overlap).
doi_in_related: true; otherwise weigh title/authoroverlap. If unsure, show the user the candidates and ask. Repository full-text search
(especially figshare/zenodo) often returns unrelated deposits, so fetch --auto
refuses to download a candidate with no DOI match / weak title overlap (it falls back
to journal guidance), and fetch --download of such a candidate requires --force.
A candidate flagged ⚠ no DOI/title match in the listing is probably not this paper's data.
paperconan on the output..xlsx/.csv/.tsv, say so and name the other file types.fetch now prints a journal-guidance block derived from the DOI (publisher + doi.org article link + where that publisher puts source data,
e.g. Nature's ...MOESM). Relay it — never imply "checked = clean".
for open-access papers it serves supplementary material as one zip, and fetch
downloads it and extracts the tabular members automatically. Dryad is
discovery/listing only — its download API needs authentication, so report Dryad hits
to the user and point them to the Dryad dataset page to download the files manually.
Three artifacts may exist in the output dir:
| File | Audience | Purpose |
|---|---|---|
| --- | --- | --- |
scan.json | you (the agent) | full structured findings — parse this when analyzing |
report.html | the user | self-contained interactive HTML report — tell the user to open it in a browser |
REPORT.md | optional | markdown report; only present with --md |
{
"tool": "paperconan",
"tool_version": "0.4.0", // for provenance when the report is archived / shared
"scanned_at": "2026-05-29T02:08:53+00:00",
"input_dir": "...",
"paper": {"doi": "10.1038/...", "title": "..."}, // provenance, or null (see below)
"n_files": 3,
"n_blocks_with_findings": 8,
"relations_blocks": [
{
"file": "ED_Fig8b.xlsx",
"sheet": "Sheet1",
"block": {"rows": "6-15", "cols": "1-30", "header": [...]},
"relations": [...], // cross-column relations
"progressions": [...], // arithmetic progressions
"equal_pairs": [...], // pairs of columns with many equal rows
"within_col": [...], // within-column anomalies
"identical_after_rounding": [...] // cells matching after rounding
}
],
// per-sheet last-digit χ². Each: {label, n, chi2, p, p_adj, fdr_significant, counts, top}
// Filter on fdr_significant (BH-FDR q ≤ 0.05), NOT raw p — dozens of sheets are tested.
"digit_distribution": [...],
// per-sheet two-decimal ending counts. Each: {label, n, n_unique, top}
"decimal_endings": [...],
// bit-identical / value-overlap across sheets (same file OR cross-file). See fields below.
"cross_sheet_findings": [...]
}
paper provenance is populated from a paperconan_source.json sidecar that
paperconan fetch --download/--auto writes alongside the data, or from
paperconan . It is null when neither is present
(a bare directory audit) — never read null as "no paper".
kind: detector name (see references/detectors.md)severity: "high" | "medium" | "low"rule: human-readable rule string e.g. col[27] ≡ col[28] in 9/10 rowsn: sample size for the ruleevidence: block snippet {headers, rows, highlight_cols, ...} — used by report.html, but you can also surface a few highlighted values if usefullikely_benign (optional): a common innocent explanation for this kind — surface it to the user alongside the finding so a signal is never reported as a verdictdense_block (optional, column-relation / equal-pair findings): true means this finding comes from a sheet that floods with pairwise column relations (a dense / correlated matrix — correlation tables, normalized replicate panels). Such findings are auto-demoted to low severity because identical/linear columns there are expected by construction, not a duplication red flag — don't treat them as high-severity signalsame_file: whether the two sheets live in one workbook or span two filesfigure_a / figure_b / same_figure: parsed figure identity (e.g. main:5, ext:6). When same_figure is true the overlap is a combined-vs-individual re-plot of one display item — it is downgraded to low and carries a context note. Cross-figure / cross-file overlaps keep high/medium and are the ones worth checking against the legend.delta: how the two near-duplicate tables differ — {pattern, modified_cells, shared_values, only_in_a, only_in_b}. pattern is one of:perfect_dup — identical value multiset (clean re-plot)superset — one side strictly contains the other (e.g. an extra replicate column, n=5 vs n=6)value_tweaked — cells changed in place (copy-then-tweak fingerprint; most worth investigating)value_divergent — both sides hold values the other lackscross_sheet_position_identical is the single most-investigated paperconan signal — same position, same value, across "independent" sheets.high and medium together.report.html. That file has the actual table snippets with the suspicious cells highlighted — much easier for the user than re-reading xlsx.paperconan output is a statistical anomaly, NOT a determination of misconduct. Do not:
Do:
Full response template lives in references/interpretation.md.
共 1 个版本