Collect an initial paper set with enough metadata to support downstream ranking, taxonomy building, and citation generation.
When online, prefer rich arXiv metadata (categories, arxiv_id, pdf_url, published/updated, etc.). When offline, accept an export and convert it cleanly.
Always read:
references/domain_pack_overview.md — how domain packs drive topic-specific behaviorDomain packs (loaded by topic match):
assets/domain_packs/llm_agents.json — pinned IDs, query rewrite rules for LLM agent topicsUse scripts/run.py only for:
id_list backfillDo not treat run.py as the place for:
assets/domain_packs/)queries.md (keywords, excludes, time window)papers/papers_raw.jsonl (JSONL; 1 paper per line)title, authors, year, url, abstractarxiv_id, pdf_url, categories, primary_category, published, updated, doi, journal_ref, commentpapers/papers_raw.csvid_list using --enrich-metadata or queries.md enrich_metadata: true.queries.md and expand into concrete query strings.title, authors (array), year, url, abstractmax_results if specified.papers/papers_raw.jsonl exists.title, authors, year, url.papers/papers_raw.jsonl; append notes to STATUS.md.output/ before writing is approved.python scripts/run.py --helppython scripts/run.py --workspace --query "" --max-results 200 python scripts/run.py --workspace --input --query : repeatable; multiple queries are unioned--exclude : repeatable; excludes applied after retrieval--max-results : cap total retrieved--input : offline mode (CSV/JSON/JSONL)--enrich-metadata: best-effort enrich via arXiv id_list (needs network)queries.md also supports: keywords, exclude, time window, max_results, enrich_metadatapython scripts/run.py --workspace --query "LLM agent" --query "tool use" --exclude "survey" --max-results 300 id_list fetch):python scripts/run.py --workspace --query 2509.02547 --max-results 1 papers/import.csv (or .json/.jsonl) under the workspace, then run: python scripts/run.py --workspace queries.md):- time window: { from: 2022, to: 2025 } then run offline import normallypapers/papers_raw.jsonl is emptySymptom:
Causes:
queries.md is empty.Solutions:
papers/import.csv|json|jsonl in the workspace or pass --input.queries.md.--query to sanity-check the parser.Symptom:
authors/year/abstract/url.Causes:
Solutions:
title, authors, year, url, abstract.--enrich-metadata to backfill missing fields (best effort).queries.md has non-empty keywords (or pass --query).papers/import.* and rerun.共 1 个版本