← 返回
未分类 Key 中文

web-reader-pro

Advanced web content extraction skill for OpenClaw using multi-tier fallback strategy (Jina → Scrapling → WebFetch) with intelligent routing, caching, qualit...
高级网页内容提取技能,针对OpenClaw设计,采用多层级回退策略(Jina → Scrapling → WebFetch),具备智能路由、缓存和质量保障功能。
0xcjl 0xcjl 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 需要
★ 2
Stars
📥 319
下载
💾 0
安装
1
版本
#latest

概述

Web Reader Pro - OpenClaw Skill

Overview

Web Reader Pro is an advanced web content extraction skill for OpenClaw that uses a multi-tier fallback strategy with intelligent routing, caching, and quality assessment.

Features

1. Three-Tier Fallback Strategy

  • Tier 1: Jina Reader API - Fast, reliable, best for most websites
  • Tier 2: Scrapling + Playwright - Dynamic content rendering for JS-heavy sites
  • Tier 3: WebFetch Fallback - Basic extraction for simple pages

2. Jina Quota Monitoring

  • Tracks API call count with persistent counter
  • Warning alerts when approaching quota limits
  • Automatic fallback to lower-tier methods when quota exhausted

3. Smart Cache Layer

  • Short-term caching (configurable TTL, default 1 hour)
  • Cache key based on URL hash
  • Reduces redundant API calls

4. Extraction Quality Scoring

  • Scores based on: word count, title detection, content density
  • Minimum quality threshold (default: 200 words + valid title)
  • Auto-escalation to next tier if quality below threshold

5. Domain-Level Routing Learning

  • Learns optimal extraction tier per domain
  • Persists learned routes in local JSON database
  • Adapts based on historical success rates

6. Retry with Exponential Backoff

  • Configurable max retries per tier (default: 3)
  • Exponential backoff: 1s, 2s, 4s, 8s...
  • Respects rate limits and transient failures

Installation

# Install dependencies
pip install -r requirements.txt

# Install Scrapling (requires Node.js)
./scripts/install_scrapling.sh

# Or install Scrapling manually
npm install -g @scrapinghub/scrapling

Usage

Basic Usage

from scripts.web_reader_pro import WebReaderPro

reader = WebReaderPro()
result = reader.fetch("https://example.com")
print(result['title'])
print(result['content'])

Advanced Configuration

reader = WebReaderPro(
    jina_api_key="your-jina-key",      # Optional: set via env JINA_API_KEY
    cache_ttl=3600,                      # Cache TTL in seconds (default: 3600)
    quality_threshold=200,               # Min word count for quality (default: 200)
    max_retries=3,                       # Max retries per tier (default: 3)
    enable_learning=True,                # Enable domain learning (default: True)
    scrapling_path="/usr/local/bin/scrapling"  # Path to scrapling binary
)

Result Format

{
    "title": "Page Title",
    "content": "Extracted content in markdown...",
    "url": "https://example.com",
    "tier_used": "jina|scrapling|webfetch",
    "quality_score": 85,
    "cached": False,
    "domain_learned_tier": "jina",
    "extracted_at": "2024-01-01T00:00:00Z"
}

Environment Variables

VariableDescriptionDefault
--------------------------------
JINA_API_KEYJina Reader API keyRequired for Tier 1
WEB_READER_CACHE_DIRCache directory path~/.openclaw/cache/web-reader-pro/
WEB_READER_LEARNING_DBLearning database path~/.openclaw/data/web-reader-pro/routes.json
WEB_READER_JINA_QUOTAJina quota limit100000

API Reference

WebReaderPro.fetch(url, force_refresh=False)

Fetch and extract content from a URL.

Parameters:

  • url (str): Target URL
  • force_refresh (bool): Bypass cache if True

Returns: Dict with title, content, metadata

WebReaderPro.fetch_with_tier(url, preferred_tier)

Fetch using a specific tier (bypassing automatic selection).

Parameters:

  • url (str): Target URL
  • preferred_tier (str): "jina", "scrapling", or "webfetch"

WebReaderPro.get_jina_status()

Get current Jina API quota usage.

Returns: Dict with count, limit, percentage, warnings

WebReaderPro.clear_cache(url=None)

Clear cache for specific URL or all URLs.

Parameters:

  • url (str, optional): Specific URL to clear, or None for all

WebReaderPro.get_domain_routes()

Get learned domain-to-tier mappings.

Returns: Dict of domain -> preferred tier

Tier Comparison

TierSpeedJS RenderingBest ForCost
-------------------------------------------
JinaFastNoStatic pages, articlesAPI calls
ScraplingMediumYesSPAs, dynamic contentCPU
WebFetchFastestNoSimple pages, fallbacksFree

License

MIT

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 07:03 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Tavily 搜索

jacky1n7
通过 Tavily API 进行网页搜索(Brave 替代方案)。当用户要求搜索网页、查找来源或链接,且 Brave 网页搜索不可用时使用。
★ 274 📥 100,984
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 211 📥 70,057
data-analysis

AdMapix

fly0pants
AdMapix 原始数据层,提供广告创意、应用、排名、下载/收入及市场元数据。返回 AdMapix API 的结构化 JSON;调用方...
★ 297 📥 142,022