← 返回
未分类 Key 中文

Phoenix Scraper

Resilient multi-layer web scraper with automatic failover. Use when scraping web content that may be JS-rendered, behind bot protection, or on sites that blo...
弹性的多层网页抓取工具,具备自动故障转移功能。适用于抓取可能由JavaScript渲染、受机器人防护或被防爬阻止的网页内容。
stevojarvisai-star stevojarvisai-star 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 需要
★ 0
Stars
📥 368
下载
💾 0
安装
1
版本
#brightdata#failover#latest#playwright#scraping

概述

Phoenix Scraper

Resilient three-tier failover scraper. Never returns empty — if one method fails, the next activates automatically.

Failover Chain

Tier 1: Brave Search API (fast, free tier, 2k req/month)
    ↓ (on block/empty/timeout)
Tier 2: Bright Data Web Unlocker (residential proxy, JS-render optional)
    ↓ (on block/429/timeout)
Tier 3: Playwright headless browser (full JS execution)

Quick Start

from scripts.phoenix_scraper import scrape

# Basic fetch
result = scrape("https://example.com/page")

# With JS rendering (for SPA/dynamic sites)
result = scrape("https://example.com/page", render_js=True)

# With specific Bright Data zone
result = scrape("https://linkedin.com/jobs/...", zone="job_search_scraper")

Zone Routing

Use CaseZone
----------------
Job boards (LinkedIn, Glassdoor, Reed, Indeed)job_search_scraper
Social media, news, general webweb_unlocker
X.com / TwitterUse X API v2 (see references/x-api.md)

Bright Data render_js

Set render_js=True for JS-heavy sites (CWJobs, TotalJobs, ContractorUK). Adds "render": True (boolean) to payload and uses 60s timeout.

Critical: Use boolean True, not string "html" — Bright Data validation rejects strings.

Bright Data Premium Domains (Cost Note)

LinkedIn, Glassdoor, and other heavily-protected job boards may be classified as Premium Domains in your Bright Data zone (updated quarterly). API call syntax is identical — but cost per request is higher. Check your zone's Premium Domains list if costs spike unexpectedly.

Playwright Stealth (2026 Enhancement)

For Tier 3, consider installing playwright-stealth to patch headless browser fingerprints — reduces detection on Cloudflare/advanced bot-protected sites:

pip install playwright-stealth
# Optional enhancement in phoenix_scraper.py Tier 3:
from playwright_stealth import stealth_sync
stealth_sync(page)

The base Playwright tier works without this, but stealth patching significantly improves success rates on heavily protected sites (Coupang, Naver, etc.) as of 2026.

URL Formatting

  • CWJobs/TotalJobs: use hyphen-slugs — finance-systems-consultant NOT finance+systems+consultant
  • Glassdoor: https://www.glassdoor.co.uk/Job/united-kingdom-{slug}-jobs-SRCH_IL.0,14_IN2_KO15,{end}.htm

Environment Variables

BRIGHT_DATA_API_KEY=<key>          # Bright Data API key
BRIGHT_DATA_ZONE=job_search_scraper # Default zone (override per-call)
BRAVE_API_KEY=<key>                # Brave Search API key
X_BEARER_TOKEN=<token>             # X API v2 bearer token (for X.com)

X.com Monitoring

For X/Twitter, use X API v2 (not scraping). See references/x-api.md for endpoint details and rate limits.

Error Handling

All tiers log failures before escalating. On total failure, returns {"success": False, "html": "", "method": "all_failed", "error": ""}.

Never raises exceptions — always returns a result dict.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 11:38 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Agent Trading Bot

stevojarvisai-star
AI驱动的交易机器人框架OpenClaw,连接加密货币交易所(Binance、Hyperliquid、Bluefin)和预测市场(Polymarket、Kalshi),通过API实现对接。
★ 0 📥 516

Self-Healing Agent

stevojarvisai-star
OpenClaw代理的自恢复与自动修复系统。监控代理健康,检测故障(崩溃的定时任务、损坏的技能、配置损坏、内存问题等)。
★ 0 📥 716

Prompt Library Manager

stevojarvisai-star
精心策划的 OpenClaw 代理提示词模板库,可存储、搜索、版本控制、标签化并跨会话和代理重用提示词模板。在被要求 "s..." 时使用。
★ 0 📥 453