概述

Paper Ingest Normalizer

Convert raw literature inputs into standardized records safe for project memory, paper databases, and downstream synthesis pipelines.

Input

One of the following is required:

pdf_path — local path to PDF file
url — link to paper/article
raw_text — extracted or pasted text
metadata_blob — existing metadata dict

Plus:

project_id — required for any writeback
source_type — one of: pdf, doi, url, text, metadata
optional tags — list of strings for categorization

Output Schema

Return a structured object:

title: string
authors: string[] | null
year: number | null
source: string          # journal, conference, preprint, etc.
doi_or_url: string | null
project_id: string
paper_type: string      # experimental, theoretical, review, etc.
material_system: string | null   # e.g. "钙钛矿太阳能电池", " graphene FET"
device_type: string | null       # e.g. "FTO/glass", "flexible substrate"
key_variables: string[] | null   # independent variables studied
key_metrics: string[] | null     # measured outcomes (PCE, mobility, etc.)
core_findings: string            # 2-3 sentence neutral summary
claimed_mechanism: string | null
limitations: string | null
normalized_summary: string       # 1-2 paragraph structured summary
uncertain_fields: string[] | null  # fields that could not be verified
writeback_ready: boolean        # true only if key identity fields present
writeback_payload: object        # the record to write into project memory

Rules

Never write into project memory without project_id. Ask if not provided.
Separate direct observations from claimed interpretations. Mark inference vs. direct extraction.
Preserve uncertainty. Use null for missing fields; list in uncertain_fields.
Do not invent missing bibliographic fields. Don't hallucinate authors, year, etc.
Do not over-claim. Keep core_findings and normalized_summary grounded in what the text actually says.
Never conflate abstract with findings. The abstract states intentions; findings are what the data supports.
If writeback_ready = false, list explicitly which fields are missing and why.

PDF Extraction

For PDFs, use the summarize skill or pdfplumber/PyMuPDF to extract text before processing.

Workflow

Identify source type — determine which input field is populated
Extract raw content — PDF text, URL content, or use provided raw text
Parse bibliographic fields — title, authors, year, source, DOI
Identify research content — material system, device type, variables, metrics
Distill findings — separate what was measured from what was claimed
Assemble writeback_payload — structured record matching the schema above
Assess completeness — set writeback_ready based on presence of key identity fields

Failure Handling

If parsing is incomplete:

Return partial structured output with all successfully extracted fields
Populate uncertain_fields with the list of fields that could not be determined
Set writeback_ready = false when title, authors, or year are missing

Cross-Reference

For synthesis after normalization, see the research skill for paper synthesis workflows.

版本历史

共 1 个版本

v1.0.0 当前

2026-05-07 06:58 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

Paper Ingest Normalizer

概述

Paper Ingest Normalizer

Input

Output Schema

Rules

PDF Extraction

Workflow

Failure Handling

Cross-Reference

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Watchdog Heartbeat

Skill Security Auditor Jack

Daily Loop Runner