← 返回
未分类 中文

Hfpclawer Citation Audit

Verify academic paper citations using a three-tier fallback pipeline: local FTS5 database → Semantic Scholar API → OpenAlex API. Supports single citation che...
使用三层回退管道验证学术论文引用:本地 FTS5 数据库 → Semantic Scholar API → OpenAlex API,支持单个引用检查……
diamond2nv diamond2nv 来源
未分类 clawhub v0.5.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 288
下载
💾 0
安装
1
版本
#latest

概述

hfpclawer Citation Audit

Verify whether a cited academic paper actually exists, using a three-tier

pipeline that degrades gracefully when local data or remote APIs are unavailable.

> Who this is for: Researchers, reviewers, and literature-survey authors who

> need to check whether a citation refers to a real paper.

Overview

The audit engine tries three sources in order, stopping at the first

confirmation:

                    ┌──────────────────────────┐
 User:              │  hfpclawer audit verify   │
 "Is this paper     │  "Fourier Neural Operator"│
 real?"             └─────────────┬────────────┘
                                  │
                    ┌─────────────┼─────────────┐
                    ▼             ▼              ▼
              ┌─────────┐  ┌──────────┐  ┌──────────┐
              │ L1:     │  │ L2:      │  │ L3:      │
              │ Local   │→ │ Semantic │→ │ OpenAlex │
              │ FTS5 DB │  │ Scholar  │  │          │
              │ (1ms)   │  │ (200ms)  │  │ (200ms)  │
              └─────────┘  └──────────┘  └──────────┘

Each source independently reports one of four statuses:

  • VERIFIED — the paper exists in this source
  • SUSPECTED — possible match (similar title, but not exact)
  • NOT_FOUND — no match found
  • ERROR — source unavailable (no local DB / API rate-limited)

When to Use

  • A user cites a paper you cannot find — verify its existence
  • You team is writing a survey / literature review — batch audit the reference list
  • You downloaded an LLM-generated paper and need to fact-check its citations
  • You want to know whether a paper is a known arXiv preprint or a non-existent hallucination

Prerequisites

  • pip install hfpclawer>=0.5.0
  • No API keys needed for basic use (S2 + OpenAlex use anonymous tier)
  • Optional: Set S2_API_KEY env var for 10x faster Semantic Scholar lookups
  • Optional: Set OPENALEX_POLITE_EMAIL env var for 10x faster OpenAlex lookups
  • Optional: Clone arxiv-metadata-service repo for L1 local FTS5 (see references/local-db-setup.md)

Quick Start

1. Verify a Single Citation (most common)

# Auto mode: tries local DB first, then Semantic Scholar, then OpenAlex
hfpclawer audit verify "Fourier Neural Operator for Parametric Partial Differential Equations"

# Short title works too — includes substring fallback
hfpclawer audit verify "Fourier Neural Operator"

# Exact arXiv ID
hfpclawer audit verify --arxiv-id 2010.08895

2. Use a Specific Source

# Local FTS5 only (needs arxiv_meta.db)
hfpclawer audit verify "Attention Is All You Need" --source local

# Semantic Scholar only
hfpclawer audit verify "Attention Is All You Need" --source s2

# OpenAlex only
hfpclawer audit verify "Attention Is All You Need" --source openalex

3. Check a Reference List from File

# Save citations in a text file, one per paragraph
cat > refs.txt << 'EOF'
The FNO paper (arXiv:2010.08895) shows promising results.
PINNs were introduced by Raissi et al. (2019) "Physics-informed neural networks".
EOF

hfpclawer audit --refs refs.txt

Output Format

Each result shows:

  • [OK] VERIFIED — paper confirmed; includes title, authors, source
  • [?] SUSPECTED — possible but uncertain; shows top matches
  • [NF] NOT_FOUND — no evidence of this paper
  • [ERR] ERROR — source unavailable (DB not found, rate limited)
[OK] VERIFIED
  Title: Fourier Neural Operator
  Authors: Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, ...
  Sources: openalex: VERIFIED

How Statuses Are Determined

StatusLocal DBSemantic ScholarOpenAlex
--------:--------::----------------::--------:
VERIFIEDFTS5 match with title similarity >= 0.70Title search ≥ 0.70Title search ≥ 0.70
SUSPECTEDFTS5 match with score 0.40-0.69
NOT_FOUNDNo FTS5 resultsNo ≥0.70 matchNo ≥0.70 match
ERRORDB not found / corrupt429/5xx / network429/5xx / network

Title matching: Title similarity uses difflib.SequenceMatcher on

normalized (lowercase, punctuation-stripped) titles. Short titles that are

substrings of longer titles also pass the 0.70 threshold.

Batch Modes

From a Text File

hfpclawer audit --refs references.txt

The parser detects:

  • arXiv:XXXX.XXXXX identifiers
  • "Title" (Author, Year) patterns
  • Author (Year) "Title" patterns

Via Python API

from hfpclawer.citation_audit import check_citation

result = check_citation(
    "Fourier Neural Operator",
    authors_hint="Li",
    year_hint=2020,
    source="auto",       # or "local" / "s2" / "openalex"
)
print(result["status"])  # VERIFIED | NOT_FOUND | ERROR
print(result.get("authors", "N/A"))
print(result.get("per_source", {}))  # Per-source breakdown

Common Pitfalls

  1. Short/two-word queries may fail L1 because FTS5's porter stemmer requires

actual content words. Use at least 3-4 significant words for local DB queries.

  1. Semantic Scholar rate-limits aggressively without API key (~1 req/s,

~100 req/day anonymous). Set S2_API_KEY for production use.

  1. OpenAlex polite pool is free and gives 10 req/s — set

OPENALEX_POLITE_EMAIL to your institution email.

  1. No L1 without arxiv-metadata-service: The local FTS5 DB requires

git clone of the separate arxiv-metadata-service repo. Without it,

the auto chain starts at L2 (slower but still works).

Verification Checklist

  • [ ] Single citation works: hfpclawer audit verify "Known Paper Title"
  • [ ] arXiv ID works: hfpclawer audit verify --arxiv-id 2010.08895
  • [ ] Non-existent paper returns NOT_FOUND
  • [ ] Network errors return ERROR (not crash)
  • [ ] Batch mode processes multiple citations from file
  • [ ] hfpclawer audit verify --help shows source options

版本历史

共 1 个版本

  • v0.5.0 当前
    2026-05-21 15:05 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Clearml Metrics Logging Pattern

diamond2nv
Standardized ClearML metrics logging patterns for PDEBench experiment scripts — train loss, validation metrics, competit
★ 0 📥 289

expflow Pipeline HPO

diamond2nv
PDEBench竞赛工作流编排 - 使用expflow实现三种管道模式(完整/快速/跳过)、分布式超参数优化、剪枝器集成以及ClearML超参数管理
★ 0 📥 299

Hfpclawer Paper Search

diamond2nv
发现、下载并整理来自 arXiv、HuggingFacePapers 和 OpenReview 的学术论文。多源搜索 → 去重 → PDF 下载 → Markdown 转换
★ 0 📥 302