Use this skill when you want model-led query planning and model-led relevance filtering.
Scripts are tools. The model performs the reasoning and decisions:
python3 scripts/init_collection_run.py \
--output-root /path/to/data \
--topic "LLM applications in Lean 4 formalization" \
--keywords "Lean 4,LLM,formalization" \
--categories "cs.AI,cs.LO" \
--target-range 5-10 \
--lookback 30d \
--language English
This creates a run directory with task_meta.json, task_meta.md, query_results/, and query_selection/.
--language must be set manually for each collection run.
--language is non-English (for example Chinese), generated markdown files are written in that language:
task_meta.md
query_results/
/metadata.md
papers_index.md
Follow these rules before running per-query fetch:
3 queries for small/medium targets (2-5, 5-10).
4 queries for larger targets (10-50 or above).
target_max be the upper bound in target range.
target_per_query = ceil(target_max / query_count).
max_results = target_per_query 2 (or 3 when recall is more important).
5-10, query count 3 -> target_per_query=4 -> each query fetches 8-12.
OR inside the same semantic group (synonyms), and AND across groups.
OR to increase recall.
LLM OR "large language model" OR AI.
"Lean 4" OR Lean OR "formal language".
AND to keep relevance.
(LLM-group) AND (Lean-group).
() AND () [AND ]
Theme A: LLM applications in Lean 4 formalization
all:"LLM applications in Lean 4 formalization"
(all:"Lean 4" OR all:"Lean" OR all:"formal language") AND (all:"LLM" OR all:"large language model" OR all:"AI")
(all:"Lean" OR all:"formalization") AND (all:"LLM" OR all:"large language model") AND all:"theorem proving"
(all:"Lean" OR all:"proof assistant") AND (all:"AI" OR all:"LLM")
Theme B: agentic tool use for code generation
all:"agentic tool use code generation"
(all:"agentic" OR all:"autonomous agent") AND (all:"LLM" OR all:"large language model")
(all:"tool use" OR all:"function calling") AND (all:"coding assistant" OR all:"code generation")
Theme C: multimodal reasoning with retrieval
all:"multimodal reasoning retrieval"
(all:"multimodal" OR all:"vision language") AND (all:"retrieval" OR all:"RAG")
(all:"multimodal model" OR all:"vision language model") AND (all:"reasoning" OR all:"tool use")
Model defines queries manually, for example:
all:"Lean 4"
all:"LLM formalization"
all:"AI formal verification"
Recommended batch mode (safe defaults, serial execution):
python3 scripts/fetch_queries_batch.py \
--run-dir /path/to/run-dir \
--plan-json /path/to/query_plan.json
In batch mode, the script auto-applies:
--min-interval-sec 5
--retry-max 4
--retry-base-sec 5
--retry-max-sec 120
--retry-jitter-sec 1
/.runtime/arxiv_api_state.json ) for throttling
max_results from target_range and query count (default oversample x2, cap 60)
task_meta.json
Minimal query_plan.json only needs label and query.
See references/query-plan-format.md.
You normally do not need to set fetch-control args manually.
If you need one-by-one manual fetch, run each query:
python3 scripts/fetch_query_metadata.py \
--run-dir /path/to/run-dir \
--label lean4 \
--query 'all:"Lean 4"' \
--max-results 30 \
--min-interval-sec 5 \
--retry-max 4 \
--language English
Output files:
query_results/ (indexed full metadata list)
query_results/ (human-readable preview)
Date range is applied directly in arXiv API search_query via submittedDate:[... TO ...].
No second local date-filter pass is performed.
Rate-limit controls in fetch_query_metadata.py:
--min-interval-sec (default 5.0)
--retry-max (default 4)
--retry-base-sec (default 5.0)
--retry-max-sec (default 120.0)
--retry-jitter-sec (default 1.0)
--rate-state-path (optional override; default is /.runtime/arxiv_api_state.json )
--force to bypass cache and re-fetch
For each query list, the model reads indexed results and decides what to keep.
Use keep specs by index and/or arXiv ID when merging.
To explicitly drop one weak query in later iterations, set that label to an empty keep list in selection-json.
python3 scripts/merge_selected_papers.py \
--run-dir /path/to/run-dir \
--keep lean4:0,2,4 \
--keep llm-formalization:1,3 \
--language English
or with selection-json:
{
"lean4-round1": [0, 2, 4],
"lean4-round2": [],
"formalization-round2": [1, 3, 5]
}
An empty list means this query label is intentionally dropped (keep 0).
This writes final outputs:
/metadata.json
/metadata.md
papers_index.json
papers_index.md
If relevance is weak or final count is insufficient after Step 4, iterate:
papers_index.md and per-paper metadata quality.
OR terms, keep cross-group AND constraints).
python3 scripts/merge_selected_papers.py \
--run-dir /path/to/run-dir \
--incremental \
--selection-json /path/to/updated_selection.json \
--language English
Incremental behavior:
query_selection/selected_by_query.json.
selection-json override previous selections for those labels.
[].
Stop retrying when:
If relevant papers are genuinely scarce, it is valid to finish below the original minimum target range.
--max-results.
--force when necessary.
429 Too Many Requests, retry later and/or increase --min-interval-sec.
references/io-contract.md for exact files and schema.
This skill is a sub-skill of arxiv-summarizer-orchestrator.
Pipeline position:
arxiv-search-collector (this skill)
arxiv-paper-processor
arxiv-batch-reporter
This skill produces the initial paper-set structure and metadata that Stage B and Stage C depend on.
共 1 个版本