概述

Visible Text Extractor

Use this skill to turn a webpage article, URL, screenshot set, long image set, or local image collection into complete, readable, reusable text.

Core workflow

Extract visible body text from the main source.
Discover ordered images and GIF-like assets.
OCR image content when needed.
Preserve a raw/audit layer.
Run a human-first cleanup pass.
Classify image-like content by likely information type.
Reconstruct image content into human-readable supplements instead of raw OCR dumps.
Output polished markdown first; keep raw OCR as JSON or appendix data.

What this skill is good at

General webpage article extraction
WeChat / 公众号 article extraction with special handling
News pages, blogs, tutorials, explainers, and image-heavy articles
Screenshots and long-image OCR
Image directory OCR in display order
GIF frame extraction plus OCR when ffmpeg is available
Rebuilding noisy OCR into a cleaner reading version
Producing either reader-friendly clean output or full transcript-style output

Main script

scripts/extract_visible_text.py

Supporting resources

scripts/postprocess_ocr_text.py — clean OCR output, merge broken spacing, remove obvious garbage, and regroup into readable sections
scripts/extract_with_browser.js — browser-rendered fallback for JS-heavy pages
scripts/extract_gif_frames.sh — GIF frame extraction via ffmpeg
scripts/build_deliverable_docx.js — convert cleaned markdown into a Word document
scripts/build_transcript_docx.js — convert transcript-style markdown into a Word document
scripts/build_authorized_capture_docx.py — one-step pipeline for already-authorized browser pages, saved HTML, screenshots, and mixed inputs into clean markdown + JSON + Word deliverable
scripts/extract_visible_text_deliverable.py — one-step pipeline from source input to clean markdown + JSON + Word deliverable
scripts/extract_visible_text_transcript_deliverable.py — one-step pipeline for transcript-style full extraction output
scripts/extract_visible_text_reading_order_deliverable.py — one-step pipeline for reading-order transcript output
scripts/build_wechat_interleaved_docx.py — reconstruct WeChat article reading order by interleaving extracted body blocks and image OCR text in original flow order
scripts/ocr_high_accuracy.py — higher-accuracy OCR with preprocessing variants and segmented long-image handling
references/output-schema.md — target output structure and cleanup rules
references/deliverable-workflow.md — one-step deliverable workflow guidance
references/troubleshooting.md — failure patterns, environment limits, and how to respond cleanly
references/product-positioning.md — what mature deliverable quality means for this skill
references/generalization-plan.md — how to evolve the skill across travel deals, rule pages, event posters, and tutorial long images
references/universal-article-extractor-spec.md — generalized capability contract for article, mixed-media, and screenshot-heavy extraction

Required behavior

When raw OCR is noisy, do not stop at extraction.

Keep the raw candidate layer for traceability.
Prefer readability over raw OCR score when two candidates are close.
Remove decorative fragments, isolated symbols, repeated garbage, and near-duplicate lines from the polished result.
Keep uncertainty visible instead of pretending confidence.
Never silently drop a major section when partial reconstruction is possible.
Never present raw OCR dump as the final answer if a cleaner reconstruction can be produced.
Preserve article structure when available: title, subtitle, author/source/time, heading levels, paragraphs, lists, captions, table-like rows, and appended notes.
Treat information-bearing images as first-class content rather than an appendix afterthought.
For image-heavy pages, support transcript-style and reading-order outputs in addition to clean article outputs.

WeChat / 公众号 handling

For mp.weixin.qq.com URLs:

Try dedicated article extraction first when available.
Fall back to static HTML parsing.
Fall back again to browser rendering if needed.
When the user cares about article readability, prefer reconstructing the final Word output in original reading order instead of appending all image OCR at the end.
Use scripts/build_wechat_interleaved_docx.py when the task is specifically “keep original article order” for WeChat posts.
If the page is blocked / validation-gated, report blocked: true clearly instead of pretending success.

Typical commands

Extract URL to markdown:

python3 {baseDir}/scripts/extract_visible_text.py \
  --url 'https://example.com/post' \
  --format markdown \
  --output result.md

Extract URL to JSON:

python3 {baseDir}/scripts/extract_visible_text.py \
  --url 'https://example.com/post' \
  --format json \
  --output result.json

Extract WeChat article with fallbacks:

python3 {baseDir}/scripts/extract_visible_text.py \
  --url 'https://mp.weixin.qq.com/s/xxxx' \
  --browser-fallback \
  --page-screenshot-ocr \
  --format markdown \
  --output wechat.md

Extract local screenshot or long image:

python3 {baseDir}/scripts/extract_visible_text.py \
  --image ./screenshot.png \
  --ocr-images \
  --format markdown \
  --output image-result.md

Run OCR post-processing:

python3 {baseDir}/scripts/postprocess_ocr_text.py \
  --input-json ./ocr-result.json \
  --title 'Clean Result' \
  --body-text 'Optional summary or body text' \
  --output-json ./clean.json \
  --output-markdown ./clean.md

Run the one-step deliverable pipeline:

python3 {baseDir}/scripts/extract_visible_text_deliverable.py \
  --url 'https://mp.weixin.qq.com/s/xxxx' \
  --browser-fallback \
  --page-screenshot-ocr \
  --ocr-images \
  --dedupe \
  --output-prefix ./deliverable/result

This should emit:

result.raw.json
result.clean.json
result.clean.md
result.docx

Run the already-authorized capture pipeline when the page can be opened in a browser or exported/saved first:

python3 {baseDir}/scripts/build_authorized_capture_docx.py \
  --url 'https://example.com/page' \
  --browser-capture \
  --ocr-images \
  --dedupe \
  --output-prefix ./deliverable/captured

Useful cases:

browser can open the page but direct fetch is incomplete
user provides a saved HTML page plus screenshots
user wants one command that turns visible page content into a Word document
user wants status visibility instead of silent long waits

Operational expectations for this pipeline:

print stage logs so long OCR jobs do not look stuck
fail loudly if expected outputs are not created
detect obvious WeChat validation/interstitial text early
optionally send the generated docx back to Feishu in one run
when a source is blocked, stop pretending and switch to authorized-input workflows: saved HTML, screenshots, long images, copied text

Practical optimization rule:

do not keep hammering a blocked source in the same mode
if browser/direct fetch returns validation text, pivot immediately to the best authorized artifact path
prioritize delivery quality: visible content captured by the user is better than repeated blocked fetch attempts

Key options

--url webpage URL
--text-file local plain text / markdown input
--html-file local saved HTML page
--image PATH add one local image or GIF; repeat as needed
--image-dir DIR OCR all supported images / GIFs in a directory
--format markdown|json output format
--output PATH output file path
--ocr-images OCR discovered or provided images
--dedupe deduplicate repeated merged lines
--browser-fallback use browser-rendered fallback for incomplete pages
--page-screenshot-ocr OCR the browser full-page screenshot as a last resort
--gif-mode none|placeholder conservative GIF handling mode

Quality standard

Default target: produce something a human can read comfortably and share without cleanup.

Release-quality target for article deliverables:

preserve the article's original reading order whenever the source structure allows it
avoid dumping all image OCR at the end when images belong in the middle of the article
prefer a comfortable reading experience over a mechanically grouped OCR appendix
keep English-heavy charts, dashboards, and mixed Chinese-English figures readable enough that key labels, axes, legends, and result summaries survive extraction

The skill should increasingly treat extraction as a full article understanding and recovery problem, not only a body scrape plus OCR problem:

recover visible article structure from normal webpages, WeChat posts, blogs, tutorials, and mixed-media articles
infer whether an image is mainly a price/product page, rules page, poster/event page, course outline, scenery/introduction card, or table-like detail page
pull out high-value facts first when the user wants a clean readable result
preserve near-complete text when the user wants transcript completeness
avoid raw OCR dumps as the main deliverable unless the user explicitly wants audit output

When the user explicitly wants completeness, the skill must support a fuller extraction mode:

treat each discovered image as a first-class source
prefer segmented OCR for tall or dense images
preserve near-complete per-image text blocks before compressing into summaries
keep summary and full-text layers separate instead of replacing one with the other
support reading-order transcript output so text and image-derived content can be followed from start to finish

For clean article outputs, prefer a structure like:

Title
Metadata (author/source/time) when meaningful
Main sections in order
Integrated image-derived supplements where needed
Uncertainty notes only when necessary

For transcript outputs, prefer a structure like:

Title
Intro/body chunks in order
Image text blocks in order or reading order
Tail matter / credits / appended notes

Mature-skill rule:

default users toward the clean markdown / docx outputs unless they ask for transcript completeness
keep raw JSON for audit, not as the main deliverable
degrade honestly when the source is blocked or image quality is poor
do not optimize only for one article family; keep checking travel-deal posts, rule/scoring posts, event posters, news/blog/tutorial pages, and course-outline long images

Read these references when needed:

references/output-schema.md
references/deliverable-workflow.md
references/troubleshooting.md
references/product-positioning.md
references/generalization-plan.md
references/universal-article-extractor-spec.md

Environment notes

OCR depends on the local ocr-local skill or compatible Tesseract.js setup.
Browser fallback depends on real browser availability plus playwright-core support.
GIF frame extraction depends on ffmpeg.
Some pages remain partially inaccessible due to login, anti-bot, or validation flows; mark those limits explicitly.

版本历史

共 2 个版本

v1.2.0 当前

2026-05-03 08:56 安全安全
v1.1.1

2026-03-31 05:46 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

Visible Text Extractor

概述

Visible Text Extractor

Core workflow

What this skill is good at

Main script

Supporting resources

Required behavior

WeChat / 公众号 handling

Typical commands

Key options

Quality standard

Environment notes

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Tavily 搜索

Data Analysis

Stock Analysis