← 返回
未分类 Key 中文

Gmail Link Archiver

Connects to Gmail via IMAP, filters emails by subject prefix keyword in a specified mailbox, crawls links found in filtered emails using Playwright (to bypas...
通过IMAP连接Gmail,按主题前缀关键词过滤指定邮箱的邮件,使用Playwright抓取过滤后邮件中的链接(以绕过…)
jinwangmok jinwangmok 来源
未分类 clawhub v1.1.0 1 版本 100000 Key: 需要
★ 0
Stars
📥 379
下载
💾 1
安装
1
版本
#latest

概述

Gmail Link Archiver

Archive web content from your email links. This skill connects to Gmail via IMAP, filters emails by a subject prefix keyword, crawls every link using Playwright (headless Chromium), converts pages to Markdown, and saves them to your OpenClaw workspace.

Quick Start

1. Install dependencies (one-time)

bash references/setup.sh

This automatically installs:

  • playwright (Python) + Chromium browser binary
  • html2text for HTML→Markdown conversion

2. First run — interactive setup

python3 references/gmail_link_archiver.py

The first run will prompt you for:

SettingDescriptionDefault
-------------------------------
IMAP serverGmail IMAP hostimap.gmail.com
IMAP portSSL port993
Gmail addressYour full email address
App passwordGmail App Password (NOT your regular password)
Default mailboxIMAP folder to searchINBOX
Subject prefixFilter emails whose subject starts with this
Workspace pathWhere to save Markdown files~/openclaw-workspace/mail-archive

Credentials are saved locally to ~/.config/gmail-link-archiver/config.json with 0600 permissions. They are never transmitted or logged.

> Gmail App Password: You need to generate an App Password at

> https://myaccount.google.com/apppasswords (requires 2FA enabled).

3. Subsequent runs

After the first setup, subsequent runs will read credentials from the saved config:

# Use saved config defaults
python3 references/gmail_link_archiver.py

# Override mailbox and prefix on the fly
python3 references/gmail_link_archiver.py --mailbox "INBOX" --subject-prefix "[Newsletter]"

# Save to a different workspace
python3 references/gmail_link_archiver.py --workspace ~/my-archive

# Limit number of links to crawl
python3 references/gmail_link_archiver.py --max-links 10

# Re-run the setup interview
python3 references/gmail_link_archiver.py --reconfigure

How It Works

  1. Connect — Authenticates to Gmail via IMAP SSL
  2. Filter — Searches the specified mailbox for emails matching the subject prefix
  3. Extract — Parses email bodies (HTML + plain text) to find HTTP/HTTPS links
  4. Crawl — Opens each link in headless Chromium via Playwright (bypasses bot detection, renders JavaScript)
  5. Convert — Transforms the crawled HTML into clean Markdown with metadata headers
  6. Save — Writes each Markdown file to the workspace directory

Pipeline Diagram

Gmail IMAP ──► Filter by Subject ──► Extract Links
                                          │
                                          ▼
                         Playwright + Chromium (headless)
                                          │
                                          ▼
                              HTML → Markdown (html2text)
                                          │
                                          ▼
                           Save to OpenClaw Workspace

CLI Reference

usage: gmail_link_archiver.py [-h] [--mailbox MAILBOX]
                               [--subject-prefix PREFIX]
                               [--workspace PATH]
                               [--max-links N]
                               [--reconfigure]

Options:
  --mailbox, -m        IMAP mailbox to search (default: from config)
  --subject-prefix, -s Subject prefix to filter emails
  --workspace, -w      Directory to save Markdown files
  --max-links          Max number of links to crawl (default: 50)
  --reconfigure        Re-run the setup interview

Output Format

Each crawled page is saved as a Markdown file with YAML frontmatter:

---
source: https://example.com/article
crawled_at: 2026-03-27T12:00:00Z
---

# Article Title

Article content converted to clean Markdown...

Files are named using a sanitized version of the URL plus a short hash for uniqueness.

Example Usage with Claude

Ask Claude to run the archiver:

> "Run the Gmail Link Archiver to crawl links from my emails with subject starting with '[ReadLater]'"

Claude will execute:

python3 references/gmail_link_archiver.py --subject-prefix "[ReadLater]"

Or to set up fresh:

> "Set up the Gmail Link Archiver with my credentials"

python3 references/gmail_link_archiver.py --reconfigure

Troubleshooting

"App password" rejected?

  • Ensure 2-Step Verification is enabled on your Google account
  • Generate a new App Password at https://myaccount.google.com/apppasswords
  • Use the 16-character password without spaces

Playwright/Chromium issues?

# Reinstall Chromium
python3 -m playwright install chromium
# Install system dependencies (Linux)
sudo python3 -m playwright install-deps chromium

No emails found?

  • Check the mailbox name (use INBOX, [Gmail]/All Mail, etc.)
  • Verify the subject prefix matches exactly (case-sensitive)
  • Try a broader prefix

Permission denied on config file?

chmod 600 ~/.config/gmail-link-archiver/config.json

Security

  • Credentials are stored locally at ~/.config/gmail-link-archiver/config.json
  • File permissions are set to 0600 (owner read/write only)
  • Credentials are never transmitted anywhere except to the IMAP server
  • Credentials are never logged or printed to stdout
  • Use Gmail App Passwords (not your main Google password)
  • The config directory has 0700 permissions

Requirements

  • Python 3.8+
  • Linux (Ubuntu/Debian) for MVP
  • Gmail account with IMAP enabled and App Password
  • Internet connection for IMAP and web crawling

版本历史

共 1 个版本

  • v1.1.0 当前
    2026-05-03 10:59 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

dev-programming

Simple File Backup

jinwangmok
在相同目录中创建带时间戳的备份副本。适用于用户想在修改前保存文件副本的情况。
★ 0 📥 442

Word Count

jinwangmok
统计文件中的字数、行数和字符数,输出一行摘要。用于需要了解文件长度或文档字数时。
★ 0 📥 406
dev-programming

Simple File Tree

jinwangmok
显示文件夹的目录树结构,以可视化的树形格式展示文件和子目录。在用户想查看某文件夹中包含哪些文件时使用。
★ 1 📥 463