Use this skill to turn a webpage article, URL, screenshot set, long image set, or local image collection into complete, readable, reusable text.
ffmpeg is availablescripts/extract_visible_text.pyscripts/postprocess_ocr_text.py — clean OCR output, merge broken spacing, remove obvious garbage, and regroup into readable sectionsscripts/extract_with_browser.js — browser-rendered fallback for JS-heavy pagesscripts/extract_gif_frames.sh — GIF frame extraction via ffmpegscripts/build_deliverable_docx.js — convert cleaned markdown into a Word documentscripts/build_transcript_docx.js — convert transcript-style markdown into a Word documentscripts/build_authorized_capture_docx.py — one-step pipeline for already-authorized browser pages, saved HTML, screenshots, and mixed inputs into clean markdown + JSON + Word deliverablescripts/extract_visible_text_deliverable.py — one-step pipeline from source input to clean markdown + JSON + Word deliverablescripts/extract_visible_text_transcript_deliverable.py — one-step pipeline for transcript-style full extraction outputscripts/extract_visible_text_reading_order_deliverable.py — one-step pipeline for reading-order transcript outputscripts/build_wechat_interleaved_docx.py — reconstruct WeChat article reading order by interleaving extracted body blocks and image OCR text in original flow orderscripts/ocr_high_accuracy.py — higher-accuracy OCR with preprocessing variants and segmented long-image handlingreferences/output-schema.md — target output structure and cleanup rulesreferences/deliverable-workflow.md — one-step deliverable workflow guidancereferences/troubleshooting.md — failure patterns, environment limits, and how to respond cleanlyreferences/product-positioning.md — what mature deliverable quality means for this skillreferences/generalization-plan.md — how to evolve the skill across travel deals, rule pages, event posters, and tutorial long imagesreferences/universal-article-extractor-spec.md — generalized capability contract for article, mixed-media, and screenshot-heavy extractionWhen raw OCR is noisy, do not stop at extraction.
For mp.weixin.qq.com URLs:
scripts/build_wechat_interleaved_docx.py when the task is specifically “keep original article order” for WeChat posts.blocked: true clearly instead of pretending success.Extract URL to markdown:
python3 {baseDir}/scripts/extract_visible_text.py \
--url 'https://example.com/post' \
--format markdown \
--output result.md
Extract URL to JSON:
python3 {baseDir}/scripts/extract_visible_text.py \
--url 'https://example.com/post' \
--format json \
--output result.json
Extract WeChat article with fallbacks:
python3 {baseDir}/scripts/extract_visible_text.py \
--url 'https://mp.weixin.qq.com/s/xxxx' \
--browser-fallback \
--page-screenshot-ocr \
--format markdown \
--output wechat.md
Extract local screenshot or long image:
python3 {baseDir}/scripts/extract_visible_text.py \
--image ./screenshot.png \
--ocr-images \
--format markdown \
--output image-result.md
Run OCR post-processing:
python3 {baseDir}/scripts/postprocess_ocr_text.py \
--input-json ./ocr-result.json \
--title 'Clean Result' \
--body-text 'Optional summary or body text' \
--output-json ./clean.json \
--output-markdown ./clean.md
Run the one-step deliverable pipeline:
python3 {baseDir}/scripts/extract_visible_text_deliverable.py \
--url 'https://mp.weixin.qq.com/s/xxxx' \
--browser-fallback \
--page-screenshot-ocr \
--ocr-images \
--dedupe \
--output-prefix ./deliverable/result
This should emit:
result.raw.jsonresult.clean.jsonresult.clean.mdresult.docxRun the already-authorized capture pipeline when the page can be opened in a browser or exported/saved first:
python3 {baseDir}/scripts/build_authorized_capture_docx.py \
--url 'https://example.com/page' \
--browser-capture \
--ocr-images \
--dedupe \
--output-prefix ./deliverable/captured
Useful cases:
Operational expectations for this pipeline:
Practical optimization rule:
--url webpage URL--text-file local plain text / markdown input--html-file local saved HTML page--image PATH add one local image or GIF; repeat as needed--image-dir DIR OCR all supported images / GIFs in a directory--format markdown|json output format--output PATH output file path--ocr-images OCR discovered or provided images--dedupe deduplicate repeated merged lines--browser-fallback use browser-rendered fallback for incomplete pages--page-screenshot-ocr OCR the browser full-page screenshot as a last resort--gif-mode none|placeholder conservative GIF handling modeDefault target: produce something a human can read comfortably and share without cleanup.
Release-quality target for article deliverables:
The skill should increasingly treat extraction as a full article understanding and recovery problem, not only a body scrape plus OCR problem:
When the user explicitly wants completeness, the skill must support a fuller extraction mode:
For clean article outputs, prefer a structure like:
For transcript outputs, prefer a structure like:
Mature-skill rule:
Read these references when needed:
references/output-schema.mdreferences/deliverable-workflow.mdreferences/troubleshooting.mdreferences/product-positioning.mdreferences/generalization-plan.mdreferences/universal-article-extractor-spec.mdocr-local skill or compatible Tesseract.js setup.playwright-core support.ffmpeg.共 2 个版本