← 返回
内容创作 中文

Crawl4ai

AI-powered web scraping framework for extracting structured data from websites. Use when Codex needs to crawl, scrape, or extract data from web pages using AI-powered parsing, handle dynamic content, or work with complex HTML structures.
AI驱动的网页抓取框架,用于从网站提取结构化数据。使用场景:使用AI解析从网页提取数据、处理动态内容及复杂HTML结构。
codylrn804
内容创作 clawhub v1.0.0 1 版本 99466.9 Key: 无需
★ 2
Stars
📥 3,132
下载
💾 202
安装
1
版本
#latest

概述

Crawl4ai

Overview

Crawl4ai is an AI-powered web scraping framework designed to extract structured data from websites efficiently. It combines traditional HTML parsing with AI to handle dynamic content, extract text intelligently, and clean and structure data from complex web pages.

When to Use This Skill

Use when Codex needs to:

  • Extract structured data from web pages (products, articles, forms, tables, etc.)
  • Scrape websites with dynamic content or complex JavaScript
  • Clean and normalize extracted data from various HTML structures
  • Work with APIs or web services that return HTML
  • Handle CORS limitations by scraping directly
  • Process web content at scale with reliability

Trigger phrases:

  • "Extract data from this website"
  • "Scrape this page for [specific data]"
  • "Parse this HTML"
  • "Get data from [URL]"
  • "Extract structured information from [website]"
  • "Scrape [website] for [data type]"
  • "Web scrape [URL]"

Quick Start

Basic Usage

from crawl4ai import AsyncWebCrawler, BrowserMode

async def scrape_page(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            browser_mode=BrowserMode.LATEST,
            headless=True
        )
        return result.markdown, result.clean_html

Extracting Structured Data

from crawl4ai import AsyncWebCrawler, JsonModeScreener
import json

async def extract_products(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            screenshot=True,
            javascript=True,
            bypass_cache=True
        )
        # Extract product data
        products = []
        for item in result.extracted_content:
            if item['type'] == 'product':
                products.append({
                    'name': item['name'],
                    'price': item['price'],
                    'url': item['url']
                })
        return products

Common Tasks

Web Scraping Basics

Scenario: User wants to scrape a website for all article titles.

from crawl4ai import AsyncWebCrawler

async def scrape_articles(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            javascript=True,
            verbose=True
        )
        # Extract article titles from HTML
        articles = result.extracted_content if result.extracted_content else []
        titles = [item.get('name', item.get('text', '')) for item in articles]
        return titles

Trigger: "Scrape this site for article titles" or "Get all titles from [URL]"

Dynamic Content Handling

Scenario: Website loads data via JavaScript.

from crawl4ai import AsyncWebCrawler

async def scrape_dynamic_site(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            javascript=True,  # Wait for JS execution
            wait_for="body",   # Wait for specific element
            delay=1.5,         # Wait time after load
            headless=True
        )
        return result.markdown

Trigger: "Scrape this dynamic website" or "This page needs JavaScript to load data"

Structured Data Extraction

Scenario: Extract specific fields like prices, descriptions, etc.

from crawl4ai import AsyncWebCrawler

async def extract_product_details(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            screenshot=True,
            js_code="""
                const products = document.querySelectorAll('.product');
                return Array.from(products).map(p => ({
                    name: p.querySelector('.name')?.textContent,
                    price: p.querySelector('.price')?.textContent,
                    url: p.querySelector('a')?.href
                }));
            """
        )
        return result.extracted_content

Trigger: "Extract product details from this page" or "Get price and name from [URL]"

HTML Cleaning and Parsing

Scenario: Clean messy HTML and extract clean text.

from crawl4ai import AsyncWebCrawler

async def clean_and_parse(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            remove_tags=['script', 'style', 'nav', 'footer', 'header'],
            only_main_content=True
        )
        # Clean and return markdown
        clean_text = result.clean_html
        return clean_text

Trigger: "Clean this HTML" or "Extract main content from this page"

Advanced Features

Custom JavaScript Injection

async def custom_scrape(url, custom_js):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            js_code=custom_js,
            js_only=True  # Only execute JS, don't download resources
        )
        return result.extracted_content

Session Management

from crawl4ai import AsyncWebCrawler

async def multi_page_scrape(base_url, urls):
    async with AsyncWebCrawler() as crawler:
        results = []
        for url in urls:
            result = await crawler.arun(
                url=url,
                session_id=f"session_{url}",
                bypass_cache=True
            )
            results.append({
                'url': url,
                'content': result.markdown,
                'status': result.success
            })
        return results

Best Practices

  1. Always check if the site allows scraping - Respect robots.txt and terms of service
  2. Use appropriate delays - Add delays between requests to avoid overwhelming servers
  3. Handle errors gracefully - Implement retry logic and error handling
  4. Be selective with data - Extract only what you need, don't dump entire pages
  5. Store data reliably - Save extracted data in structured formats (JSON, CSV)
  6. Clean URLs - Handle redirects and malformed URLs

Error Handling

async def robust_scrape(url):
    try:
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(
                url=url,
                timeout=30000  # 30 seconds timeout
            )
            if result.success:
                return result.markdown, result.extracted_content
            else:
                print(f"Scraping failed: {result.error_message}")
                return None, None
    except Exception as e:
        print(f"Scraping error: {str(e)}")
        return None, None

Output Formats

Crawl4ai supports multiple output formats:

  • Markdown: Clean, readable text (result.markdown)
  • Clean HTML: Structured, cleaned HTML (result.clean_html)
  • Extracted Content: Structured JSON data (result.extracted_content)
  • Screenshot: Visual representation (result.screenshot)
  • Links: All links found on page (result.links)

Resources

scripts/

Python scripts for common crawling operations:

  • scrape_single_page.py - Basic scraping utility
  • scrape_multiple_pages.py - Batch scraping with pagination
  • extract_from_html.py - HTML parsing helper
  • clean_html.py - HTML cleaning utility

references/

Documentation and examples:

  • api_reference.md - Complete API documentation
  • examples.md - Common use cases and patterns
  • error_handling.md - Troubleshooting guide

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-28 17:49 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

content-creation

AdMapix

fly0pants
广告情报与应用数据分析助手,支持搜索广告素材、分析应用排名、下载量、收入及市场洞察,用于广告素材和竞品分析。
★ 294 📥 136,401
content-creation

Humanizer

biostartechnology
消除AI写作痕迹,使文本更自然真实。基于维基百科"AI写作特征"指南,识别并修正夸张象征、宣传用语、肤浅-ing分析、模糊归因、破折号滥用、三项排比、AI词汇、负面平行结构及冗长连接词等模式。
★ 857 📥 199,258
content-creation

Baidu Wenku AIPPT

ide-rea
使用百度文库 AI 智能生成 PPT,自动根据内容选择模板。
★ 66 📥 46,131