A powerful, universal web crawler optimized for BBC News but capable of crawling other sites. It integrates advanced scraping technologies including Crawl4AI and Playwright to handle dynamic content and anti-bot protections.
crawl4ai: Primary method using AsyncWebCrawler for high performance and accuracy.playwright: Full browser rendering fallback for complex dynamic pages.requests: Fast fallback for static content.auto: Automatically detects the best method (Prioritizes Crawl4AI).YYYY-MM-DD/Category/Title.md.requirements.txt for Python packages.# 1. Install dependencies
# Note: install.py supports passing arguments to pip, e.g., --break-system-packages
python install.py
# Example for environments requiring --break-system-packages:
python install.py --break-system-packages
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --max-pages 50
# Force Crawl4AI
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method crawl4ai
# Force Playwright
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --method playwright
# Control depth and delay
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --depth 3 --delay 2.5
# Specify output directory
python universal_crawler_v2.py --url https://www.bbc.co.uk/news --output ./my_data
python install.py again.共 1 个版本