Automated pipeline for extracting, converting, and downloading all references from a research paper PDF.
Task Progress:
- [ ] Step 1: Extract full text from PDF
- [ ] Step 2: Parse references into individual entries
- [ ] Step 3: Enrich metadata via WebFetch (authors, journal, DOI)
- [ ] Step 4: Generate BibTeX (.bib) file
- [ ] Step 5: Download arXiv papers
- [ ] Step 6: Multi-source OA download (PMC, bioRxiv, MDPI, Nature, Unpaywall...)
- [ ] Step 7: Generate Word doc for unfound references
- [ ] Step 8: Verify results
Install Python and Node.js dependencies before running scripts:
pip install pdfplumber
npm install docx
Run: python scripts/extract_text.py
This uses pdfplumber to extract text from every page of the PDF. The output is saved as a plain text file for subsequent parsing.
Run: python scripts/parse_references.py
Locates the "References" section and parses individual entries. Supports multiple formats via auto-detection:
Format A — Numbered references (e.g., 1. Title, URL):
^\d+\.\s.^\d+\.\s+.Format B — Author-first references (traditional academic style):
Firstname L...) AND the previously accumulated text looks complete (ends with a year or URL).year. AuthorName patterns mid-string.Critical — broken-URL repair: PDF extraction often inserts spaces inside URLs (e.g., https:// doi.org/...). The parser applies fix_broken_urls() to all extracted text, collapsing https:// into https:// before any further processing. This is essential for correct DOI and URL extraction.
Output: a JSON array of objects with fields: index, title, url, doi, arxiv_id, raw.
The parser also handles:
↩, ↩2, etc.)--- PAGE N ---), standalone page numbers, and common PDF headers参考文献, 参考资料, References, Bibliography)This is a critical new step. After parsing, the agent must enrich each reference with complete metadata by visiting its URL/DOI link. This step is necessary because:
...)Recommended approach: Use the Task tool to launch 2-4 parallel subagents, each enriching a batch of references via WebFetch. For each reference, visit the URL and extract: full_title, authors, journal, year, doi, volume, pages.
Save the enriched metadata as JSON files (e.g., metadata_1_16.json, metadata_17_32.json, etc.) for use in BibTeX generation.
Important: When DOIs from PDF text are broken (contain spaces), reconstruct them before fetching. For example, doi.org/ 10.1016/j.bios.2025.117428 should become doi.org/10.1016/j.bios.2025.117428.
Run: python scripts/generate_bib.py
Now accepts enriched metadata JSON (from Step 3) with complete fields: full_title, authors, journal, year, doi, volume, pages. Generates proper BibTeX entries with:
article for journal papers, unpublished for preprints (SSRN, bioRxiv, arXiv)lastname + year + _index (e.g., li2021_1)Fallback mode: If enriched metadata is unavailable, the script still works with the raw parsed references JSON (Step 2 output) using text-parsing heuristics, but quality will be lower.
Run: python scripts/download_arxiv.py
Downloads PDFs from https://arxiv.org/pdf/{id} for every reference with an arXiv ID. Features:
{index:02d}_{Author}_{Year}_{Title}.pdf%PDF- magic bytes; deletes invalid files (HTML error pages, redirects)Run: python scripts/search_non_arxiv.py
For references without arXiv IDs, attempts download from multiple Open Access sources in priority order:
.pdf, download directly10.1101/, construct https://www.biorxiv.org/content/{doi}v1.full.pdf (also try medRxiv)https://europepmc.org/backend/ptpmcrender.fcgi?accid={pmc_id}&blobtype=pdf (most reliable)https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/pdf/10.3390/, construct https://mdpi-res.com/d_attachment/{journal}/{journal}-{vol}-{article_num}/article_deploy/{journal}-{vol}-{article_num}.pdf; fallback: https://www.mdpi.com/{ISSN}/{vol}/{issue}/{article}/pdfnature.com), try https://www.nature.com/articles/{article_id}.pdf; for Springer/BMC, try https://link.springer.com/content/pdf/{doi}.pdf10.1371/ DOIs, use https://journals.plos.org/plosone/article/file?id={doi}&type=printable10.1039/ DOIs, try https://pubs.rsc.org/en/content/articlepdf/{year}/{journal_abbr}/{article_id}https://onlinelibrary.wiley.com/doi/pdfdirect/{doi}https://www.sciencedirect.com/science/article/pii/{pii}/pdfftPDF validation: ALL downloads are validated by checking the %PDF- magic bytes in the first 5 bytes. Files that are HTML pages, redirects, or error pages are automatically deleted. Minimum valid size: 5KB.
Rate limiting: 2-second delay between attempts to the same host; 3-second delay for arXiv.
Outputs a JSON file of still-missing references for the Word document step.
Run: node scripts/create_missing_doc.js
Important: Before running, the agent must prepare the missing references JSON with the correct format:
[
{
"authors": "First Author, Second Author, ...",
"title": "Full Paper Title",
"venue": "Journal Name (Year), Volume: Pages",
"url": "https://doi.org/10.xxxx/...",
"note": "DOI Link"
}
]
The agent should mark open-access papers that failed to download (e.g., due to 403 anti-bot blocks) with "note": "Open Access - DOI" so the user knows they can likely access them manually.
The script creates a professional Word document in landscape orientation with a table listing: #, Authors, Title, Venue, and Download Page (with clickable hyperlinks). Uses alternating row colors and header styling.
After completion, verify:
.bib file exists and has correct entry count matching total references%PDF- header).docx file exists for unfound referencesUse a Task subagent for verification to keep the main context clean.
Broken URLs from PDF extraction: This is the #1 issue. PDF text extraction frequently splits URLs across lines, inserting spaces (e.g., https:// doi.org/ or https://www. nature.com/). ALWAYS apply URL repair before extracting DOIs or downloading. The regex re.sub(r'(https?://)\s+', r'\1', text) handles the most common case, but also check for spaces after www. and within path segments.
Anti-bot protections (HTTP 403): Many publishers (MDPI, Elsevier, Wiley, ACS) aggressively block automated downloads even for open-access papers. Common mitigations:
mdpi-res.com CDN domain instead of www.mdpi.com/pdfft endpoint sometimes works for OA articlesPMC PDF download: The URL format https://pmc.ncbi.nlm.nih.gov/articles/PMC.../pdf/ often returns HTML instead of PDF. Use EuropePMC's backend URL instead: https://europepmc.org/backend/ptpmcrender.fcgi?accid=PMC...&blobtype=pdf.
PDF validation is mandatory: NEVER trust file size alone. Always check that the first 5 bytes of a downloaded file are %PDF-. Many failed downloads produce valid-looking HTML files (e.g., 1.8KB login pages, 403 error pages). Delete any file that fails this check.
Encoding: On Windows with GBK console, printing Unicode characters fails. Use .encode('ascii', 'replace').decode() for console output.
Reference parsing: The auto-detection approach (numbered vs. author-first) handles most academic papers. For unusual formats (e.g., Vancouver style [1], or inline-citation styles), the agent should inspect the extracted text and adjust patterns.
arXiv rate limiting: Always include a 3-second delay between downloads. arXiv will block rapid requests.
SSL issues: Some corporate/educational networks require disabling SSL verification. The scripts handle this by default.
Duplicate references: Some papers cite the same work under different indices (e.g., a preprint and its published version). The agent should detect duplicates by DOI in the verification step.
The agent should adapt paths, filenames, and the missing-reference URL research to each specific paper. The scripts accept command-line arguments rather than hardcoded paths. For papers with unusual reference formats, the agent may need to adjust the parsing heuristics in parse_references.py.
共 1 个版本