← 返回
未分类 中文

Browser Automation

Web scraping and browser automation using Puppeteer. Use when the user wants to extract data from websites, crawl pages, scrape dynamic content rendered by J...
使用 Puppeteer 进行网页抓取和浏览器自动化。适用于需要从网站提取数据、爬取页面、抓取由 JavaScript 渲染的动态内容。
fasjdas fasjdas 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 388
下载
💾 1
安装
1
版本
#latest

概述

Browser Automation

Web scraping and browser automation powered by Puppeteer.

When to Use

USE this skill when:

  • "Scrape data from [URL]"
  • "Extract all [products/listings/items] from [website]"
  • "Take a screenshot of [page]"
  • "Crawl [website] and collect [info]"
  • "Fill and submit [form]"
  • Any JavaScript-rendered content that won't load without a browser

DON'T use this skill when:

  • Simple static pages → use web_fetch instead
  • APIs available → fetch API directly
  • Rate-limited sites → respect robots.txt

Quick Start

# Install Puppeteer
npm install puppeteer

# Basic scraping
node scripts/scrape.js https://example.com

Core Patterns

Launch Browser

const puppeteer = require('puppeteer');

async function scrape(url) {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });
  
  // ... extract data ...
  
  await browser.close();
}

Extract Text Content

// Get all text from a selector
const titles = await page.$$eval('h2', els => els.map(el => el.textContent.trim()));

// Get text from single element
const price = await page.$eval('.price', el => el.textContent.trim());

Extract HTML

const html = await page.$eval('.product-list', el => el.innerHTML);

Extract Attributes

const links = await page.$$eval('a', els => els.map(el => ({
  text: el.textContent.trim(),
  href: el.getAttribute('href')
})));

Wait for Content

// Wait for selector
await page.waitForSelector('.results', { timeout: 10000 });

// Wait for network idle
await page.goto(url, { waitUntil: 'networkidle2' });

// Wait for function
await page.waitForFunction(() => document.querySelectorAll('.item').length > 10);

Pagination

async function scrapeWithPagination(baseUrl, maxPages = 5) {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  
  let results = [];
  
  for (let i = 1; i <= maxPages; i++) {
    const url = `${baseUrl}?page=${i}`;
    await page.goto(url, { waitUntil: 'networkidle2' });
    
    const items = await page.$$eval('.item', els => 
      els.map(el => el.textContent.trim())
    );
    
    if (items.length === 0) break;
    results.push(...items);
  }
  
  await browser.close();
  return results;
}

Screenshots

// Full page screenshot
await page.screenshot({ path: 'screenshot.png', fullPage: true });

// Element screenshot
const element = await page.$('.chart');
await element.screenshot({ path: 'chart.png' });

Block Resources (Speed Up)

await page.setRequestInterception(true);
page.on('request', req => {
  if (['image', 'stylesheet', 'font'].includes(req.resourceType())) {
    req.abort();
  } else {
    req.continue();
  }
});

Scripts

scrape.js — Basic Scraping

// Usage: node scripts/scrape.js <url> [selector]
const puppeteer = require('puppeteer');

const url = process.argv[2];
const selector = process.argv[3] || 'body';

if (!url) {
  console.error('Usage: node scrape.js <url> [selector]');
  process.exit(1);
}

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  
  await page.goto(url, { waitUntil: 'networkidle2' });
  
  const content = await page.$$eval(selector, els => 
    els.map(el => el.textContent.trim())
  );
  
  console.log(JSON.stringify(content, null, 2));
  
  await browser.close();
})();

screenshot.js — Page Screenshots

// Usage: node scripts/screenshot.js <url> [output.png]
const puppeteer = require('puppeteer');

const url = process.argv[2];
const output = process.argv[3] || 'screenshot.png';

if (!url) {
  console.error('Usage: node screenshot.js <url> [output.png]');
  process.exit(1);
}

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  
  await page.goto(url, { waitUntil: 'networkidle2' });
  await page.screenshot({ path: output, fullPage: true });
  
  console.log(`Screenshot saved to ${output}`);
  await browser.close();
})();

crawl.js — Multi-Page Crawler

// Usage: node crawl.js <url> <selector> [maxPages]
const puppeteer = require('puppeteer');

const url = process.argv[2];
const selector = process.argv[3];
const maxPages = parseInt(process.argv[4]) || 10;

if (!url || !selector) {
  console.error('Usage: node crawl.js <url> <selector> [maxPages]');
  process.exit(1);
}

(async () => {
  const browser = await puppeteer.launch({ headless: 'new' });
  const page = await browser.newPage();
  
  let allData = [];
  
  for (let i = 1; i <= maxPages; i++) {
    const pageUrl = url.includes('?') ? `${url}&page=${i}` : `${url}?page=${i}`;
    console.error(`Crawling: ${pageUrl}`);
    
    await page.goto(pageUrl, { waitUntil: 'networkidle2' });
    
    const data = await page.$$eval(selector, els => 
      els.map(el => el.textContent.trim())
    );
    
    if (data.length === 0) break;
    allData.push(...data);
  }
  
  console.log(JSON.stringify(allData, null, 2));
  await browser.close();
})();

Common Selectors

TargetSelector
------------------
All linksa
All imagesimg
Headingsh1, h2, h3
Listsul li, ol li
Tablestable tr
Cards/Items.item, .card, .product
Prices.price, [class*="price"]
Descriptions.description, .summary

Tips

  • Check robots.txt before scraping: curl example.com/robots.txt
  • Add delays between requests to avoid bans: await new Promise(r => setTimeout(r, 2000))
  • Use networkidle2 for SPAs (Single Page Apps)
  • Debug with screenshots when selectors fail
  • Set user agent for sites that block bots:

```javascript

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36');

```

Reference

For detailed Puppeteer API, see puppeteer/docs/api.md.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-20 05:27 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

AdMapix

fly0pants
AdMapix 原始数据层,提供广告创意、应用、排名、下载/收入及市场元数据。返回 AdMapix API 的结构化 JSON;调用方...
★ 297 📥 141,559
it-ops-security

MCP Sentinel

fasjdas
审计MCP和AI代理配置文件,检测危险命令、宽泛文件系统访问、内联密钥和提示注入风险,按严重程度报告。
★ 0 📥 165
data-analysis

Tavily 搜索

jacky1n7
通过 Tavily API 进行网页搜索(Brave 替代方案)。当用户要求搜索网页、查找来源或链接,且 Brave 网页搜索不可用时使用。
★ 273 📥 100,651