← 返回
未分类

InfoSeek

Deep web information search and archival skill for comprehensive research on persons, organizations, or products. Uses multiple search engines (Baidu, Tavily...
expeditionhub
未分类 clawhub v2.0.0 100000 Key: 无需
★ 0
Stars
📥 316
下载
💾 0
安装

概述

InfoSeek - Deep Web Search & Archival

Overview

InfoSeek performs comprehensive web research on any subject (person, organization, product) across multiple search engines, deduplicates results, extracts clean content, and archives everything with full metadata in organized folders.

Prerequisites

Before executing a search task, verify these skills are installed:

import os
from pathlib import Path

workspace = os.environ.get('OPENCLAW_WORKSPACE')
skills_dir = Path(workspace) / 'skills'

required = ['baidu-search', 'tavily', 'Multi-Search-Engine', 'agent-browser-clawdbot-0.1.0']
missing = [s for s in required if not (skills_dir / s).exists()]

If any are missing, instruct the user to install them:

openclaw skills install baidu-search
openclaw skills install tavily-search
openclaw skills install multi-search-engine

Workflow

Phase 0: Task Setup

  1. Confirm the search subject — name, organization, or product
  2. Collect optional context — background info, time range, output format (default: .md), special requirements
  3. Check dependencies — run the prerequisite check above
  4. Create archive folder — run:

```bash

python scripts/infoseek_helper.py create-folder ""

```

Phase 1: Multi-Engine Deep Search

Execute searches across all available engines. Each engine runs independently.

1.1 Baidu Search (100+ pages)

Use the baidu-search skill:

  • Query: " "
  • Depth: 100+ pages
  • Record: URL, title, website name, publish date for each result

1.2 Tavily Search

Use tavily_search tool:

query: "<subject> <background_context>"
search_depth: advanced
max_results: 50

1.3 Multi-Search-Engine

Use the multi-search-engine skill across multiple engines simultaneously.

1.4 Browser Deep-Crawl

For discovered URLs, use the browser tool to:

  1. Open each page
  2. Extract body content (filter ads, sidebars, comments)
  3. Extract metadata: title, author, editor, date, website name

Phase 2: Deduplication

Run URL deduplication on all collected results:

python scripts/infoseek_helper.py deduplicate "<temp_results_file>"

The script normalizes URLs (remove www, tracking params, unify http/https, remove trailing slashes) and checks against the SQLite database to skip duplicates.

Phase 3: Content Extraction & Storage

For each unique URL:

  1. Extract content using the browser tool — get title, body, metadata
  2. Filter content — remove ads, sidebars, navigation, comments, related articles, footers
  3. Generate filename:

```bash

python scripts/infoseek_helper.py generate-filename \

--date "" --title "" --website "<site>" --format "<ext>"</p><p> ```</p><p> Format: <code>YYYYMMDD-title-website.ext</code></p><ol><li><strong>Save the file</strong>:</li></ol><p> ```bash</p><p> python scripts/infoseek_helper.py save-content \</p><p> --folder "<archive_path>" --filename "<name>" --url "<url>" \</p><p> --website "<site>" --source "<source>" --date "<date>" \</p><p> --title "<title>" --author "<author>" --editor "<editor>" \</p><p> --content "<body>" --task "<subject>"</p><p> ```</p><ol><li><strong>Record in database</strong>:</li></ol><p> ```bash</p><p> python scripts/infoseek_helper.py add-url \</p><p> --url "<normalized_url>" --task "<subject>" --filename "<name>"</p><p> ```</p><h3>Phase 4: Task Report</h3><p>Output a summary when complete:</p><pre><code>InfoSeek Task Report ==================== Subject: {query} Engines used: {engines} Total found: {total} | Duplicates skipped: {dupes} | New archived: {new} Files saved: {count} Location: {path} Database records: {db_total} </code></pre><h2>File Naming</h2><p>Format: <code>YYYYMMDD-title-website.ext</code></p><ul><li>Date: 8 digits (YYYYMMDD) from page metadata</li><li>Title: page title (strip special chars <code><>:"/\|?*</code>)</li><li>Website: domain or media name</li><li>Extension: md (default), json, txt, csv, xlsx, html, docx</li></ul><p>If filename exists, append 8-char hash to prevent overwrites.</p><h2>Output Formats</h2><p>All formats include full metadata (URL, website, source, date, title, author, editor) plus body content.</p><ul><li><strong>.md</strong> — Markdown with metadata table</li><li><strong>.json</strong> — Structured JSON with metadata object and content field</li><li><strong>.txt</strong> — Plain text with header metadata</li><li><strong>.csv</strong> — One row per article, all metadata as columns</li><li><strong>.xlsx</strong> — Excel spreadsheet with metadata columns</li><li><strong>.html</strong> — Styled HTML page with metadata table</li><li><strong>.docx</strong> — Word document with metadata paragraph</li></ul><h2>Storage Structure</h2><pre><code>{workspace}/ ├── infoseek-archives/ │ ├── <subject_1>/ │ │ ├── 20260404-title-website.md │ │ └── ... │ └── <subject_2>/ └── infoseek/ ├── infoseek.db # SQLite dedup database ├── infoseek.log # Operation log └── backups/ </code></pre><h2>Deletion Policy</h2><p><strong>Strict data retention — no permanent deletes without confirmation.</strong></p><table><thead><tr><th>Operation</th><th>Confirmation</th><th>Method</th></tr></thead><tbody><tr><td>-----------</td><td>-------------</td><td>--------</td></tr><tr><td>Bulk folder delete</td><td>Required</td><td>Move to recycle bin</td></tr><tr><td>Single file delete</td><td>Required</td><td>Move to recycle bin</td></tr><tr><td>Dedup skip</td><td>Automatic</td><td>Skip only, no delete</td></tr><tr><td>Database cleanup</td><td>Required</td><td>Mark as deleted</td></tr></tbody></table><p><strong>Process:</strong></p><ol><li>List files to delete (name, URL, date)</li><li>Ask user: "Confirm deletion? Files go to recycle bin and can be recovered."</li><li>On confirmation, move to recycle bin (Windows: PowerShell, Mac/Linux: system trash)</li><li>Update database, log the deletion, confirm to user</li></ol><p><strong>Never:</strong></p><ul><li>Delete without user consent</li><li>Permanently delete (bypass recycle bin)</li><li>Delete without logging</li><li>Delete without updating database</li></ul><h2>Configuration</h2><p>Override defaults in task instructions:</p><ul><li><strong>Search depth</strong>: default 100 pages, specify e.g. "150 pages"</li><li><strong>Time range</strong>: default unlimited, specify e.g. "2020-01-01 to 2026-04-07"</li><li><strong>Output format</strong>: default md, specify e.g. "xlsx"</li><li><strong>Storage path</strong>: default <code>{workspace}/infoseek-archives/</code>, specify custom path</li></ul><h2>Troubleshooting</h2><table><thead><tr><th>Problem</th><th>Solution</th></tr></thead><tbody><tr><td>---------</td><td>----------</td></tr><tr><td>Missing search skill</td><td><code>openclaw skills install <name></code></td></tr><tr><td>Date extraction fails</td><td>Check page metadata; use <code>00000000</code> for unknown</td></tr><tr><td>Encoding errors</td><td>Ensure UTF-8; on Windows enable Unicode UTF-8 in region settings</td></tr><tr><td>Database corruption</td><td><code>python scripts/infoseek_helper.py restore-backup</code></td></tr></tbody></table><h2>Security & Privacy</h2><ul><li>All searches use public channels only</li><li>No personal data stored — only search results</li><li>SQLite database is local, never uploaded</li><li>Deletions use system recycle bin (recoverable)</li><li>All operations logged and auditable</li><li>No telemetry, no external data transmission</li></ul><h2>Version History</h2><table><thead><tr><th>Version</th><th>Date</th><th>Notes</th></tr></thead><tbody><tr><td>---------</td><td>------</td><td>-------</td></tr><tr><td>2.0.0</td><td>2026-04-07</td><td>Full rewrite: SQLite dedup, URL normalization, HTML parsing, multi-engine integration</td></tr><tr><td>1.0.0</td><td>2026-04-06</td><td>Initial version (deprecated)</td></tr></tbody></table></div> </div> </div> <div id="tab-versions" class="detail-content"> <div class="detail-section"> <h2>版本历史</h2> <p style="margin-bottom:12px;font-size:14px;color:#94a3b8;">共 1 个版本</p> <ul class="version-list"> <li> <div> <span class="version-tag">v2.0.0</span> <span style="font-size:11px;color:#5b6abf;margin-left:8px;background:#eef0ff;padding:1px 8px;border-radius:10px;">当前</span> </div> <div style="font-size:12px;color:#94a3b8;"> 2026-05-07 18:39 安全 安全 </div> </li> </ul> </div> </div> <div id="tab-security" class="detail-content"> <div class="detail-section"> <h2>安全检测</h2> <p style="color:#94a3b8;">暂无安全检测报告</p> </div> </div> <!-- Recommended Skills --> </div> <script> document.addEventListener('DOMContentLoaded',function(){ document.querySelectorAll('.detail-tab').forEach(function(btn){ btn.addEventListener('click',function(e){ var tab = this.getAttribute('data-tab'); document.querySelectorAll('.detail-tab').forEach(function(b){b.classList.remove('active')}); document.querySelectorAll('.detail-content').forEach(function(c){c.classList.remove('active')}); this.classList.add('active'); var el = document.getElementById('tab-'+tab); if(el) el.classList.add('active'); }); }); }); </script> <div class="footer"> <p>Skill工具集 © 2026</p> </div></body> </html>