← 返回
未分类

Arxiv Search

Retrieve paper metadata from arXiv using keyword queries and save results as JSONL (`papers/papers_raw.jsonl`). **Trigger**: arXiv, arxiv, paper search, meta...
使用关键词查询从arXiv检索论文元数据,并将结果保存为JSONL文件 (`papers/papers_raw.l`)。**触发词**:arXiv, arxiv, paper search, meta...
willoscar willoscar 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 1,009
下载
💾 13
安装
1
版本
#latest

概述

arXiv Search (metadata-first)

Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.

When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly.

Load Order

Always read:

  • references/domain_pack_overview.md — how domain packs drive topic-specific behavior

Domain packs (loaded by topic match):

  • assets/domain_packs/llm_agents.json — pinned IDs, query rewrite rules for LLM agent topics

Script Boundary

Use scripts/run.py only for:

  • arXiv API retrieval and XML parsing
  • offline export conversion (CSV/JSON/JSONL normalization)
  • metadata enrichment via id_list backfill

Do not treat run.py as the place for:

  • hardcoded topic detection or query rewriting (use domain packs)
  • domain-specific pinned paper lists (externalize to assets/domain_packs/)

Input

  • queries.md (keywords, excludes, time window)

Outputs

  • papers/papers_raw.jsonl (JSONL; 1 paper per line)
  • Each record includes at least: title, authors, year, url, abstract
  • When using the arXiv API online mode, records also include helpful metadata: arxiv_id, pdf_url, categories, primary_category, published, updated, doi, journal_ref, comment
  • Convenience index (optional but generated by the script):
  • papers/papers_raw.csv

Decision: online vs offline

  • If you have network access: run arXiv API retrieval.
  • If not: import an export the user provides (CSV/JSON/JSONL) and normalize fields.
  • Hybrid: if you import offline but still have network later, you can enrich missing fields (abstract/authors/categories) via arXiv id_list using --enrich-metadata or queries.md enrich_metadata: true.

Workflow (heuristic)

  1. Read queries.md and expand into concrete query strings.
  2. Retrieve results (online) or import an export (offline).
  3. Normalize every record to include at least:
    • title, authors (array), year, url, abstract
  4. Keep the set broad at this stage; dedupe/ranking comes next.
  5. Apply time window and max_results if specified.

Quality checklist

  • [ ] papers/papers_raw.jsonl exists.
  • [ ] Each line is valid JSON and contains title, authors, year, url.

Side effects

  • Allowed: create/overwrite papers/papers_raw.jsonl; append notes to STATUS.md.
  • Not allowed: write prose sections in output/ before writing is approved.

Script

Quick Start

  • python scripts/run.py --help
  • Online: python scripts/run.py --workspace --query "" --max-results 200
  • Offline import: python scripts/run.py --workspace --input

All Options

  • --query : repeatable; multiple queries are unioned
  • --exclude : repeatable; excludes applied after retrieval
  • --max-results : cap total retrieved
  • --input : offline mode (CSV/JSON/JSONL)
  • --enrich-metadata: best-effort enrich via arXiv id_list (needs network)
  • queries.md also supports: keywords, exclude, time window, max_results, enrich_metadata

Examples

  • Online (multi-query + excludes):
  • python scripts/run.py --workspace --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300
  • Fetch a single paper by arXiv ID (direct id_list fetch):
  • python scripts/run.py --workspace --query 2509.02547 --max-results 1
  • Offline auto-detect (no flags):
  • Place papers/import.csv (or .json/.jsonl) under the workspace, then run: python scripts/run.py --workspace
  • Offline import + time window (via queries.md):
  • Set - time window: { from: 2022, to: 2025 } then run offline import normally

Troubleshooting

Common Issues

Issue: papers/papers_raw.jsonl is empty

Symptom:

  • Script exits with “No results returned …” or output file is empty.

Causes:

  • Network is blocked (online mode).
  • Queries are too narrow or queries.md is empty.

Solutions:

  • Use offline import: place papers/import.csv|json|jsonl in the workspace or pass --input.
  • Broaden keywords and reduce excludes in queries.md.
  • Run with explicit --query to sanity-check the parser.

Issue: Offline import records miss fields

Symptom:

  • Downstream steps fail because records miss authors/year/abstract/url.

Causes:

  • Export columns don’t match expected fields; upstream export is incomplete.

Solutions:

  • Ensure the export contains at least title, authors, year, url, abstract.
  • If you later have network, use --enrich-metadata to backfill missing fields (best effort).

Recovery Checklist

  • [ ] Confirm queries.md has non-empty keywords (or pass --query).
  • [ ] If offline: confirm workspace has papers/import.* and rerun.
  • [ ] Spot-check 3–5 JSONL lines: valid JSON + required fields.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-30 04:22 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

knowledge-management

Obsidian

steipete
操作 Obsidian 仓库(纯 Markdown 笔记)并通过 obsidian-cli 自动化。
★ 440 📥 104,605
knowledge-management

web-tools-guide

user_ec205dbb
MANDATORY before calling web_search, web_fetch, browser, or opencli. Contains required error-handling procedures (web_se
★ 61 📥 157,066
content-creation

Argument Selfloop

willoscar
论证自循环:为草稿章节维护论证清单及前提一致性报告。**触发**:论证自循环、论证链、前提c...
★ 1 📥 511