Reference Harvester

Automated pipeline for extracting, converting, and downloading all references from a research paper PDF.

Workflow Overview

Task Progress:
- [ ] Step 1: Extract full text from PDF
- [ ] Step 2: Parse references into individual entries
- [ ] Step 3: Enrich metadata via WebFetch (authors, journal, DOI)
- [ ] Step 4: Generate BibTeX (.bib) file
- [ ] Step 5: Download arXiv papers
- [ ] Step 6: Multi-source OA download (PMC, bioRxiv, MDPI, Nature, Unpaywall...)
- [ ] Step 7: Generate Word doc for unfound references
- [ ] Step 8: Verify results

Prerequisites

Install Python and Node.js dependencies before running scripts:

pip install pdfplumber
npm install docx

Step 1: Extract Full Text

Run: python scripts/extract_text.py

This uses pdfplumber to extract text from every page of the PDF. The output is saved as a plain text file for subsequent parsing.

Step 2: Parse References

Run: python scripts/parse_references.py

Locates the "References" section and parses individual entries. Supports multiple formats via auto-detection:

Format A — Numbered references (e.g., 1. Title, URL):

Detected when the first non-header line after the section marker matches ^\d+\.\s.
Each entry starts at a line matching ^\d+\.\s+.

Format B — Author-first references (traditional academic style):

A new reference starts when the current line matches an author-name pattern (Firstname L...) AND the previously accumulated text looks complete (ends with a year or URL).
A second pass splits any accidentally merged references by detecting year. AuthorName patterns mid-string.

Critical — broken-URL repair: PDF extraction often inserts spaces inside URLs (e.g., https:// doi.org/...). The parser applies fix_broken_urls() to all extracted text, collapsing https:// into https:// before any further processing. This is essential for correct DOI and URL extraction.

Output: a JSON array of objects with fields: index, title, url, doi, arxiv_id, raw.

The parser also handles:

Stripping footnote back-references (↩, ↩2, etc.)
Skipping page markers (--- PAGE N ---), standalone page numbers, and common PDF headers
Recognizing Chinese/multilingual reference section headers (参考文献, 参考资料, References, Bibliography)

Step 3: Enrich Metadata via WebFetch

This is a critical new step. After parsing, the agent must enrich each reference with complete metadata by visiting its URL/DOI link. This step is necessary because:

PDF-extracted text often has truncated titles (ending with ...)
Author names are frequently absent in numbered/URL-only reference formats
Journal names, volume, pages, and correct DOIs are needed for proper BibTeX

Recommended approach: Use the Task tool to launch 2-4 parallel subagents, each enriching a batch of references via WebFetch. For each reference, visit the URL and extract: full_title, authors, journal, year, doi, volume, pages.

Save the enriched metadata as JSON files (e.g., metadata_1_16.json, metadata_17_32.json, etc.) for use in BibTeX generation.

Important: When DOIs from PDF text are broken (contain spaces), reconstruct them before fetching. For example, doi.org/ 10.1016/j.bios.2025.117428 should become doi.org/10.1016/j.bios.2025.117428.

Step 4: Generate BibTeX

Run: python scripts/generate_bib.py

Now accepts enriched metadata JSON (from Step 3) with complete fields: full_title, authors, journal, year, doi, volume, pages. Generates proper BibTeX entries with:

Entry type: article for journal papers, unpublished for preprints (SSRN, bioRxiv, arXiv)
Citation key: lastname + year + _index (e.g., li2021_1)
All available fields: author, title, journal, year, doi, volume, pages, eprint

Fallback mode: If enriched metadata is unavailable, the script still works with the raw parsed references JSON (Step 2 output) using text-parsing heuristics, but quality will be lower.

Step 5: Download arXiv Papers

Run: python scripts/download_arxiv.py

Downloads PDFs from https://arxiv.org/pdf/{id} for every reference with an arXiv ID. Features:

Filename format: {index:02d}_{Author}_{Year}_{Title}.pdf
Skips already-downloaded files
3-second rate limit between requests
PDF header validation: Checks that downloaded files start with %PDF- magic bytes; deletes invalid files (HTML error pages, redirects)
Minimum size check: >5KB (previously 1KB — too permissive)

Step 6: Multi-Source OA Download

Run: python scripts/search_non_arxiv.py

For references without arXiv IDs, attempts download from multiple Open Access sources in priority order:

Direct PDF URL: If the reference URL ends in .pdf, download directly
bioRxiv / medRxiv: For DOIs starting with 10.1101/, construct https://www.biorxiv.org/content/{doi}v1.full.pdf (also try medRxiv)
PMC (PubMed Central): For references with PMC IDs, try:

https://europepmc.org/backend/ptpmcrender.fcgi?accid={pmc_id}&blobtype=pdf (most reliable)
https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/pdf/

MDPI (all open access): For DOIs matching 10.3390/, construct https://mdpi-res.com/d_attachment/{journal}/{journal}-{vol}-{article_num}/article_deploy/{journal}-{vol}-{article_num}.pdf; fallback: https://www.mdpi.com/{ISSN}/{vol}/{issue}/{article}/pdf
Nature / Springer OA: For Nature-family journals (nature.com), try https://www.nature.com/articles/{article_id}.pdf; for Springer/BMC, try https://link.springer.com/content/pdf/{doi}.pdf
PLoS: For 10.1371/ DOIs, use https://journals.plos.org/plosone/article/file?id={doi}&type=printable
RSC (Royal Society of Chemistry): For 10.1039/ DOIs, try https://pubs.rsc.org/en/content/articlepdf/{year}/{journal_abbr}/{article_id}
Wiley OA: Try https://onlinelibrary.wiley.com/doi/pdfdirect/{doi}
Elsevier OA: For ScienceDirect URLs with PII, try https://www.sciencedirect.com/science/article/pii/{pii}/pdfft
arXiv title search (fallback): Search arXiv API by title words

PDF validation: ALL downloads are validated by checking the %PDF- magic bytes in the first 5 bytes. Files that are HTML pages, redirects, or error pages are automatically deleted. Minimum valid size: 5KB.

Rate limiting: 2-second delay between attempts to the same host; 3-second delay for arXiv.

Outputs a JSON file of still-missing references for the Word document step.

Step 7: Generate Word Document for Missing References

Run: node scripts/create_missing_doc.js

Important: Before running, the agent must prepare the missing references JSON with the correct format:

[
  {
    "authors": "First Author, Second Author, ...",
    "title": "Full Paper Title",
    "venue": "Journal Name (Year), Volume: Pages",
    "url": "https://doi.org/10.xxxx/...",
    "note": "DOI Link"
  }
]

The agent should mark open-access papers that failed to download (e.g., due to 403 anti-bot blocks) with "note": "Open Access - DOI" so the user knows they can likely access them manually.

The script creates a professional Word document in landscape orientation with a table listing: #, Authors, Title, Venue, and Download Page (with clickable hyperlinks). Uses alternating row colors and header styling.

Step 8: Verify Results

After completion, verify:

.bib file exists and has correct entry count matching total references
Count PDFs in the download folder; verify each is a valid PDF (%PDF- header)
.docx file exists for unfound references
Check for duplicate references (same DOI appearing under different indices)
Report success rate (downloaded / total references)

Use a Task subagent for verification to keep the main context clean.

Key Patterns & Pitfalls

Broken URLs from PDF extraction: This is the #1 issue. PDF text extraction frequently splits URLs across lines, inserting spaces (e.g., https:// doi.org/ or https://www. nature.com/). ALWAYS apply URL repair before extracting DOIs or downloading. The regex re.sub(r'(https?://)\s+', r'\1', text) handles the most common case, but also check for spaces after www. and within path segments.

Anti-bot protections (HTTP 403): Many publishers (MDPI, Elsevier, Wiley, ACS) aggressively block automated downloads even for open-access papers. Common mitigations:

Use a realistic User-Agent header
For MDPI, try the mdpi-res.com CDN domain instead of www.mdpi.com
For Elsevier, the /pdfft endpoint sometimes works for OA articles
Accept that some OA papers will need manual download; mark them clearly in the Word doc

PMC PDF download: The URL format https://pmc.ncbi.nlm.nih.gov/articles/PMC.../pdf/ often returns HTML instead of PDF. Use EuropePMC's backend URL instead: https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC...&blobtype=pdf.

PDF validation is mandatory: NEVER trust file size alone. Always check that the first 5 bytes of a downloaded file are %PDF-. Many failed downloads produce valid-looking HTML files (e.g., 1.8KB login pages, 403 error pages). Delete any file that fails this check.

Encoding: On Windows with GBK console, printing Unicode characters fails. Use .encode('ascii', 'replace').decode() for console output.

Reference parsing: The auto-detection approach (numbered vs. author-first) handles most academic papers. For unusual formats (e.g., Vancouver style [1], or inline-citation styles), the agent should inspect the extracted text and adjust patterns.

arXiv rate limiting: Always include a 3-second delay between downloads. arXiv will block rapid requests.

SSL issues: Some corporate/educational networks require disabling SSL verification. The scripts handle this by default.

Duplicate references: Some papers cite the same work under different indices (e.g., a preprint and its published version). The agent should detect duplicates by DOI in the verification step.

Customization

The agent should adapt paths, filenames, and the missing-reference URL research to each specific paper. The scripts accept command-line arguments rather than hardcoded paths. For papers with unusual reference formats, the agent may need to adjust the parsing heuristics in parse_references.py.

Reference Harvester | Auto PDF & Citation Finder

概述

Reference Harvester

Workflow Overview

Prerequisites

Step 1: Extract Full Text

Step 2: Parse References

Step 3: Enrich Metadata via WebFetch

Step 4: Generate BibTeX

Step 5: Download arXiv Papers

Step 6: Multi-Source OA Download

Step 7: Generate Word Document for Missing References

Step 8: Verify Results

Key Patterns & Pitfalls

Customization

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

All-Market Financial Data Hub

A股量化 AkShare

memory-triad