← 返回
未分类

Agent Survey Corpus

Download a small corpus of open-access arXiv survey/review PDFs about LLM agents and extract text for style learning. **Trigger**: agent survey corpus, ref c...
下载LLM代理相关的arXiv开源综述PDF,提取文本用于风格学习。**触发词**:代理综述语料库
willoscar willoscar 来源
未分类 clawhub v1.0.0 1 版本 99783.1 Key: 无需
★ 0
Stars
📥 460
下载
💾 4
安装
1
版本
#latest

概述

Agent Survey Corpus (arXiv PDFs → text extracts)

Goal: create a small, local reference library so you can learn from real agent surveys when refining:

  • C2 outline structure (paper-like sectioning)
  • C4 tables/claims organization
  • C5 writing style and density

This is intentionally not part of the pipeline; it is an optional, repo-level toolkit.

Inputs

  • ref/agent-surveys/arxiv_ids.txt

Outputs

  • ref/agent-surveys/pdfs/
  • ref/agent-surveys/text/
  • ref/agent-surveys/STYLE_REPORT.md (tracked; auto-generated summary)

Workflow

1) Edit ref/agent-surveys/arxiv_ids.txt (one arXiv id per line).

2) Run the downloader to fetch PDFs and extract the first N pages to text.

3) Skim the extracted text under ref/agent-surveys/text/:

  • look at section counts (H2), subsection granularity (H3), and how they transition between chapters.
  • identify repeated rhetorical patterns you want the pipeline writer to imitate.

Script

Quick Start

  • python scripts/run.py --help
  • python scripts/run.py --workspace . --max-pages 20

All Options

  • --workspace (use . to write into repo root)
  • --inputs (default: ref/agent-surveys/arxiv_ids.txt)
  • --max-pages (default: 20)
  • --sleep (default: 1.0)
  • --overwrite (re-download + re-extract)

Examples

  • Download/extract into repo root ref/:
  • python scripts/run.py --workspace . --max-pages 20
  • Download/extract into a specific folder (treated as workspace root):
  • python scripts/run.py --workspace /tmp/surveys --max-pages 30

Troubleshooting

  • Download fails / timeout: rerun with a larger --sleep, or try fewer ids.
  • Text extract is empty: the PDF may be scanned; try another survey or increase --max-pages.
  • Files showing up in git status: PDFs/text are ignored via .gitignore (ref//pdfs/, ref//text/).

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-30 22:57 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

knowledge-management

Obsidian

steipete
操作 Obsidian 仓库(纯 Markdown 笔记)并通过 obsidian-cli 自动化。
★ 440 📥 104,586
knowledge-management

Summarize

paudyyin
智能摘要工具,自动为长文本、文档、网页生成摘要,提取要点与关键词,支持自定义摘要长度。
★ 955 📥 517,153
content-creation

Argument Selfloop

willoscar
论证自循环:为草稿章节维护论证清单及前提一致性报告。**触发**:论证自循环、论证链、前提c...
★ 1 📥 510