← 返回
未分类

Reference Harvester | Auto PDF & Citation Finder

Extracts references from a PDF research paper, converts them to BibTeX (.bib), downloads as many referenced papers as possible (arXiv + API search), and generates a Word document listing unfound references with official download URLs. Use when the user asks to extract references, collect cited papers, download bibliography PDFs, harvest references, or build a reference library from an academic paper.
ZhangChenxi
未分类 community v1.0.0 1 版本 100000 Key: 无需
★ 2
Stars
📥 130
下载
💾 0
安装
1
版本
#latest

概述

Reference Harvester

Automated pipeline for extracting, converting, and downloading all references from a research paper PDF.

Workflow Overview

Task Progress:
- [ ] Step 1: Extract full text from PDF
- [ ] Step 2: Parse references into individual entries
- [ ] Step 3: Enrich metadata via WebFetch (authors, journal, DOI)
- [ ] Step 4: Generate BibTeX (.bib) file
- [ ] Step 5: Download arXiv papers
- [ ] Step 6: Multi-source OA download (PMC, bioRxiv, MDPI, Nature, Unpaywall...)
- [ ] Step 7: Generate Word doc for unfound references
- [ ] Step 8: Verify results

Prerequisites

Install Python and Node.js dependencies before running scripts:

pip install pdfplumber
npm install docx

Step 1: Extract Full Text

Run: python scripts/extract_text.py

This uses pdfplumber to extract text from every page of the PDF. The output is saved as a plain text file for subsequent parsing.

Step 2: Parse References

Run: python scripts/parse_references.py

Locates the "References" section and parses individual entries. Supports multiple formats via auto-detection:

Format A — Numbered references (e.g., 1. Title, URL):

  • Detected when the first non-header line after the section marker matches ^\d+\.\s.
  • Each entry starts at a line matching ^\d+\.\s+.

Format B — Author-first references (traditional academic style):

  • A new reference starts when the current line matches an author-name pattern (Firstname L...) AND the previously accumulated text looks complete (ends with a year or URL).
  • A second pass splits any accidentally merged references by detecting year. AuthorName patterns mid-string.

Critical — broken-URL repair: PDF extraction often inserts spaces inside URLs (e.g., https:// doi.org/...). The parser applies fix_broken_urls() to all extracted text, collapsing https:// into https:// before any further processing. This is essential for correct DOI and URL extraction.

Output: a JSON array of objects with fields: index, title, url, doi, arxiv_id, raw.

The parser also handles:

  • Stripping footnote back-references (, ↩2, etc.)
  • Skipping page markers (--- PAGE N ---), standalone page numbers, and common PDF headers
  • Recognizing Chinese/multilingual reference section headers (参考文献, 参考资料, References, Bibliography)

Step 3: Enrich Metadata via WebFetch

This is a critical new step. After parsing, the agent must enrich each reference with complete metadata by visiting its URL/DOI link. This step is necessary because:

  • PDF-extracted text often has truncated titles (ending with ...)
  • Author names are frequently absent in numbered/URL-only reference formats
  • Journal names, volume, pages, and correct DOIs are needed for proper BibTeX

Recommended approach: Use the Task tool to launch 2-4 parallel subagents, each enriching a batch of references via WebFetch. For each reference, visit the URL and extract: full_title, authors, journal, year, doi, volume, pages.

Save the enriched metadata as JSON files (e.g., metadata_1_16.json, metadata_17_32.json, etc.) for use in BibTeX generation.

Important: When DOIs from PDF text are broken (contain spaces), reconstruct them before fetching. For example, doi.org/ 10.1016/j.bios.2025.117428 should become doi.org/10.1016/j.bios.2025.117428.

Step 4: Generate BibTeX

Run: python scripts/generate_bib.py

Now accepts enriched metadata JSON (from Step 3) with complete fields: full_title, authors, journal, year, doi, volume, pages. Generates proper BibTeX entries with:

  • Entry type: article for journal papers, unpublished for preprints (SSRN, bioRxiv, arXiv)
  • Citation key: lastname + year + _index (e.g., li2021_1)
  • All available fields: author, title, journal, year, doi, volume, pages, eprint

Fallback mode: If enriched metadata is unavailable, the script still works with the raw parsed references JSON (Step 2 output) using text-parsing heuristics, but quality will be lower.

Step 5: Download arXiv Papers

Run: python scripts/download_arxiv.py

Downloads PDFs from https://arxiv.org/pdf/{id} for every reference with an arXiv ID. Features:

  • Filename format: {index:02d}_{Author}_{Year}_{Title}.pdf
  • Skips already-downloaded files
  • 3-second rate limit between requests
  • PDF header validation: Checks that downloaded files start with %PDF- magic bytes; deletes invalid files (HTML error pages, redirects)
  • Minimum size check: >5KB (previously 1KB — too permissive)

Step 6: Multi-Source OA Download

Run: python scripts/search_non_arxiv.py

For references without arXiv IDs, attempts download from multiple Open Access sources in priority order:

  1. Direct PDF URL: If the reference URL ends in .pdf, download directly
  2. bioRxiv / medRxiv: For DOIs starting with 10.1101/, construct https://www.biorxiv.org/content/{doi}v1.full.pdf (also try medRxiv)
  3. PMC (PubMed Central): For references with PMC IDs, try:
    • https://europepmc.org/backend/ptpmcrender.fcgi?accid={pmc_id}&blobtype=pdf (most reliable)
    • https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/pdf/
  4. MDPI (all open access): For DOIs matching 10.3390/, construct https://mdpi-res.com/d_attachment/{journal}/{journal}-{vol}-{article_num}/article_deploy/{journal}-{vol}-{article_num}.pdf; fallback: https://www.mdpi.com/{ISSN}/{vol}/{issue}/{article}/pdf
  5. Nature / Springer OA: For Nature-family journals (nature.com), try https://www.nature.com/articles/{article_id}.pdf; for Springer/BMC, try https://link.springer.com/content/pdf/{doi}.pdf
  6. PLoS: For 10.1371/ DOIs, use https://journals.plos.org/plosone/article/file?id={doi}&type=printable
  7. RSC (Royal Society of Chemistry): For 10.1039/ DOIs, try https://pubs.rsc.org/en/content/articlepdf/{year}/{journal_abbr}/{article_id}
  8. Wiley OA: Try https://onlinelibrary.wiley.com/doi/pdfdirect/{doi}
  9. Elsevier OA: For ScienceDirect URLs with PII, try https://www.sciencedirect.com/science/article/pii/{pii}/pdfft
  10. arXiv title search (fallback): Search arXiv API by title words

PDF validation: ALL downloads are validated by checking the %PDF- magic bytes in the first 5 bytes. Files that are HTML pages, redirects, or error pages are automatically deleted. Minimum valid size: 5KB.

Rate limiting: 2-second delay between attempts to the same host; 3-second delay for arXiv.

Outputs a JSON file of still-missing references for the Word document step.

Step 7: Generate Word Document for Missing References

Run: node scripts/create_missing_doc.js

Important: Before running, the agent must prepare the missing references JSON with the correct format:

[
  {
    "authors": "First Author, Second Author, ...",
    "title": "Full Paper Title",
    "venue": "Journal Name (Year), Volume: Pages",
    "url": "https://doi.org/10.xxxx/...",
    "note": "DOI Link"
  }
]

The agent should mark open-access papers that failed to download (e.g., due to 403 anti-bot blocks) with "note": "Open Access - DOI" so the user knows they can likely access them manually.

The script creates a professional Word document in landscape orientation with a table listing: #, Authors, Title, Venue, and Download Page (with clickable hyperlinks). Uses alternating row colors and header styling.

Step 8: Verify Results

After completion, verify:

  1. .bib file exists and has correct entry count matching total references
  2. Count PDFs in the download folder; verify each is a valid PDF (%PDF- header)
  3. .docx file exists for unfound references
  4. Check for duplicate references (same DOI appearing under different indices)
  5. Report success rate (downloaded / total references)

Use a Task subagent for verification to keep the main context clean.

Key Patterns & Pitfalls

Broken URLs from PDF extraction: This is the #1 issue. PDF text extraction frequently splits URLs across lines, inserting spaces (e.g., https:// doi.org/ or https://www. nature.com/). ALWAYS apply URL repair before extracting DOIs or downloading. The regex re.sub(r'(https?://)\s+', r'\1', text) handles the most common case, but also check for spaces after www. and within path segments.

Anti-bot protections (HTTP 403): Many publishers (MDPI, Elsevier, Wiley, ACS) aggressively block automated downloads even for open-access papers. Common mitigations:

  • Use a realistic User-Agent header
  • For MDPI, try the mdpi-res.com CDN domain instead of www.mdpi.com
  • For Elsevier, the /pdfft endpoint sometimes works for OA articles
  • Accept that some OA papers will need manual download; mark them clearly in the Word doc

PMC PDF download: The URL format https://pmc.ncbi.nlm.nih.gov/articles/PMC.../pdf/ often returns HTML instead of PDF. Use EuropePMC's backend URL instead: https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC...&blobtype=pdf.

PDF validation is mandatory: NEVER trust file size alone. Always check that the first 5 bytes of a downloaded file are %PDF-. Many failed downloads produce valid-looking HTML files (e.g., 1.8KB login pages, 403 error pages). Delete any file that fails this check.

Encoding: On Windows with GBK console, printing Unicode characters fails. Use .encode('ascii', 'replace').decode() for console output.

Reference parsing: The auto-detection approach (numbered vs. author-first) handles most academic papers. For unusual formats (e.g., Vancouver style [1], or inline-citation styles), the agent should inspect the extracted text and adjust patterns.

arXiv rate limiting: Always include a 3-second delay between downloads. arXiv will block rapid requests.

SSL issues: Some corporate/educational networks require disabling SSL verification. The scripts handle this by default.

Duplicate references: Some papers cite the same work under different indices (e.g., a preprint and its published version). The agent should detect duplicates by DOI in the verification step.

Customization

The agent should adapt paths, filenames, and the missing-reference URL research to each specific paper. The scripts accept command-line arguments rather than hardcoded paths. For papers with unusual reference formats, the agent may need to adjust the parsing heuristics in parse_references.py.

版本历史

共 1 个版本

  • v1.0.0 Initial release 当前
    2026-04-04 22:17 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

professional

All-Market Financial Data Hub

financial-ai-analyst
基于东方财富数据库,支持自然语言查询金融数据,覆盖A股、港股、美股、基金、债券等资产,提供实时行情、公司信息、估值、财务报表等,适用于投资研究、交易复盘、市场监控、行业分析、信用研究、财报审计、资产配置等场景,满足机构与个人需求。返回结果为
★ 136 📥 43,549
professional

A股量化 AkShare

mbpz
A股量化数据分析工具,基于AkShare库获取A股行情、财务数据、板块信息等。用于回答关于A股股票查询、行情数据、财务分析、选股等问题。
★ 208 📥 65,008
ai-agent

memory-triad

user_82882ca7
自动化三层记忆管理,防止 AI 助手跨会话失忆。自动化三层记忆管理:LCM 会话内压缩、CC 跨会话归档、MEMORY.md 长期索引维护。
★ 1 📥 173