This skill helps Openclaw scrape and extract data from websites using two powerful APIs:
| Use Case | Recommended Tool |
|---|---|
| --- | --- |
| Scrape a single page into markdown/JSON | Firecrawl /scrape |
| Crawl an entire website (follow links) | Firecrawl /crawl |
| Map all URLs on a site | Firecrawl /map |
| Search web + scrape results | Firecrawl /search |
| Scrape Instagram / TikTok / Twitter | Apify (social actors) |
| Scrape Google Maps / reviews | Apify (compass/crawler-google-places) |
| Scrape Amazon products | Apify (apify/amazon-scraper) |
| Scrape Google Search results | Apify (apify/google-search-scraper) |
| Custom actor / any Apify Store actor | Apify |
Both APIs require API keys passed via headers. Always ask the user for their key if not provided.
Firecrawl: Authorization: Bearer fc-YOUR_API_KEY
Apify: Authorization: Bearer YOUR_APIFY_TOKEN (or ?token=YOUR_TOKEN in URL)
Base URL: https://api.firecrawl.dev/v2
POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json
{
"url": "https://example.com",
"formats": ["markdown"], // Options: markdown, html, rawHtml, links, screenshot, json
"onlyMainContent": true, // Strips nav/footer/ads
"waitFor": 0, // ms to wait before scraping (for JS-heavy pages)
"timeout": 30000, // ms
"blockAds": true,
"proxy": "auto" // "auto", "basic", or "stealth"
}
Response: { "success": true, "data": { "markdown": "...", "metadata": {...} } }
Crawling is async — starts a job, then poll for results.
POST /v2/crawl
{
"url": "https://docs.example.com",
"limit": 50, // Max pages
"maxDepth": 3,
"allowExternalLinks": false,
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
}
}
Response: { "success": true, "id": "crawl-job-id" }
Poll status:
GET /v2/crawl/{crawl-job-id}
Response: { "status": "completed", "total": 50, "data": [...] }
POST /v2/map
{ "url": "https://example.com" }
Response: { "success": true, "links": [{ "url": "...", "title": "..." }] }
POST /v2/search
{
"query": "best web scraping tools 2025",
"limit": 5,
"scrapeOptions": { "formats": ["markdown"] }
}
Response: { "data": [{ "url": "...", "title": "...", "markdown": "..." }] }
POST /v2/batch/scrape
{
"urls": ["https://a.com", "https://b.com"],
"formats": ["markdown"]
}
Returns a job ID; poll with GET /v2/batch/scrape/{id}
Base URL: https://api.apify.com/v2
Auth: Pass token as query param ?token=YOUR_TOKEN or in Authorization header.
Apify runs "Actors" (pre-built scrapers). The flow is:
runId and defaultDatasetIdSUCCEEDEDPOST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json
{ ...actor-specific input... }
Response:
{
"data": {
"id": "RUN_ID",
"status": "RUNNING",
"defaultDatasetId": "DATASET_ID"
}
}
Common Actor IDs:
apify/web-scraper — generic JS scraperapify/google-search-scraper — Google SERPscompass/crawler-google-places — Google Mapsapify/instagram-scraper — Instagramclockworks/free-tiktok-scraper — TikTokapify/amazon-scraper — Amazon productsGET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN
Poll until status is SUCCEEDED or FAILED. Recommended interval: 5 seconds.
GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json
Optional params: format (json/csv/xlsx/xml), limit, offset
For short runs, use the sync endpoint — it waits and returns dataset items directly:
POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json
{ ...actor input... }
Google Search Scraper:
{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }
Google Maps Scraper:
{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }
Web Scraper (generic):
{
"startUrls": [{ "url": "https://example.com" }],
"pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
"maxPagesPerCrawl": 10
}
GET /v2/datasets/{id}/items.See references/code-templates.md for ready-to-run Python and JavaScript code for both APIs.
GET /v2/acts/{id}/runs/{runId}/logsuccess: false in Firecrawl responsesrobots.txt by defaultwaitFor (Firecrawl) or use Playwright/Puppeteer actors (Apify)limit before scalingonlyMainContent: true in Firecrawl to remove nav/footer noise共 1 个版本