概述

Newspaper Ad Recognition

Overview

This skill provides a complete workflow for recognizing and classifying advertisements in online newspaper archive pages (e.g., NewspaperSG, Chronicling America). It handles the entire pipeline from page access to result archival.

⚠️ Critical: Network Access (SSRF Policy)

The browser tool (browser) has a strict SSRF policy and cannot navigate to internal/private IP addresses. For sites like eresources.nlb.gov.sg:

> ✅ 正确方式: Use curl -x http://127.0.0.1:7897 (Clash proxy) via exec tool

>

> ❌ 错误方式: Browser tool navigation → blocked by SSRF policy

See Step 2 for the complete working command pattern.

Workflow

Step 1: Environment Check

Check Tesseract OCR availability:

```powershell

# Tesseract common install paths (check in order):

"C:\Program Files\Tesseract-OCR\tesseract.exe" # Windows standard

"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"

# Run: & "C:\Program Files\Tesseract-OCR\tesseract.exe" --version

```

Verify proxy (Clash: 127.0.0.1:7897) is running

Step 2: Page Access — NewspaperSG (eresources.nlb.gov.sg)

Do NOT use browser tool — blocked by SSRF. Use exec with curl + proxy:

# 1. GET the page (also saves session cookies to cookies.txt)
curl.exe -x http://127.0.0.1:7897 -s -c cookies.txt `
  "https://eresources.nlb.gov.sg/newspapers/digitised/article/ARTICLE_ID" `
  --max-time 30

# 2. Download newspaper images (MUST have both cookies AND Referer header)
#    Without both → returns ~4KB thumbnail instead of actual image
curl.exe -x http://127.0.0.1:7897 -s -o "area_1.webp" `
  "https://eservice.nlb.gov.sg/newspapercontent/digitised/article/ARTICLE_ID.webp?area=1&width=660&ct=ARTICLE+ILLUSTRATION" `
  --max-time 20 `
  -b "cookies.txt" `
  -H "Referer: https://eresources.nlb.gov.sg/newspapers/digitised/article/ARTICLE_ID" `
  -H "Accept: image/webp,image/*,*/*;q=0.8"

> ⚠️ Image download checklist — both are REQUIRED:

> - -b cookies.txt (session cookies from step 1)

> - -H "Referer: https://eresources.nlb.gov.sg/..." (exact article URL)

> - Without these → ~4KB thumbnail images, OCR will return nothing useful

Step 3: Determine Page Type

After fetching the article HTML, look for ct= parameters in image URLs:

ct value	Meaning
----------	---------
`ARTICLE+ILLUSTRATION`	Article content / illustrations — NOT ads
`ADVERTISEMENT`	Display ad
`CLASSIFIED`	Classifieds section
Other	Investigate further

# Quick check: count ad-related content types in HTML
Select-String -Path "article.html" -Pattern "ct=ADVERTISEMENT|ct=CLASSIFIED"

Step 4: OCR Analysis (if ct values are all ARTICLE or mixed)

Download key areas (top, middle, bottom — e.g. area=1,10,20,26) and run Tesseract:

$tesseract = "C:\Program Files\Tesseract-OCR\tesseract.exe"
& $tesseract "area_5.webp" "area_5" -l eng --psm 6

Then grep the OCR output for ad keywords:

$adPatterns = @("ADVERTISEMENT","Advertisement","FOR SALE","FOR HIRE","VACANCY","Tel:","Phone:","LIMITED OFFER","SPECIAL","DISCOUNT","BUY ONE","FREE","Pte Ltd","Co.","Fax:")
foreach ($txt in Get-ChildItem "ocr_output\*.txt") {
    $content = Get-Content $txt.FullName -Raw
    foreach ($p in $adPatterns) {
        if ($content -match $p) { Write-Host "MATCH: $p in $($txt.Name)" }
    }
}

Step 5: Ad Classification

Classify identified ads into categories:

Commercial Ads (product/service promotion)
Public Service Ads (government/non-profit)
Classified Ads (recruitment/rent/second-hand)

Step 5.5: Detailed Ad Location Reporting ⭐ NEW

After identifying ads, provide a detailed location report with the following format:

Required information per ad:

Ad type: Image-based (有图广告) or Text-only (无图广告)
Exact location: Use relative position descriptions (e.g., "左下区域", "右下角", "左侧边栏")
Size estimate: Approximate percentage of page area
Content summary: Key products/services advertised

Output template:

## 📍 广告位置分析报告 - [Newspaper Name] [Date] 第X页

### ✅ 广告1：[Company/Product Name]（**有图广告/无图广告**）
- **位置**：页面**[位置描述]**（约占总面积X%）
- **类型**：[Commercial/Public Service/Classified] - [简短描述]
- **内容**：
  - [关键点1]
  - [关键点2]
  - [联系方式/地址]

### ⚠️ 广告2：[...]

---

## 📊 统计总结
| 广告位置 | 是否有图 | 广告类型 | 占据面积 |
|---------|---------|---------|----------|
| [位置] | ✅ 有图/❌ 无图 | [类型] | ~X% |

**结论**：该页面包含**X个有图广告** + **Y个无图文字广告**，广告总面积约占页面Z%。

How to determine if ad is image-based:

✅ Image-based: Contains product photos, promotional graphics, logos, decorative elements
❌ Text-only: Pure text layout, no visual elements, resembles classified ads

Location description examples:

左上/右上/左下/右下区域 (upper-left/upper-right/lower-left/lower-right area)
左侧边栏/右侧边栏 (left/right sidebar)
顶部横幅/底部横幅 (top/bottom banner)
页面中央 (center of page)

Step 6: Result Archival

Save results in structured format:

[
  {
    "ad_id": "ST19950715_P33_AD001",
    "page": "33",
    "type": "Commercial",
    "ocr_text": "...",
    "slice_path": "area_5.webp"
  }
]

Archive path: {workspace}/newspaper_ads/{date}_{newspaper_name}/

Dependencies

Dependency	Version Required	Notes
---------------------	-----------------	-------
Tesseract OCR	≥5.4.0	Windows: `C:\Program Files\Tesseract-OCR\tesseract.exe` (may not be in PATH)
curl.exe	Any	Use `curl.exe` NOT PowerShell alias `curl`
Clash proxy	Running on 127.0.0.1:7897	For SSRF-blocked domains

> ⚠️ PowerShell tips:

> - Always use curl.exe (not curl alias) to avoid Invoke-WebRequest conflicts

> - Write scripts to .ps1 files and run with powershell -ExecutionPolicy Bypass -File script.ps1

> - The -- flag in Tesseract commands causes parsing errors when inlined — use script files

> - URL parameters with & cause PowerShell parsing errors — use script files or string concatenation

⚠️ Common Mistakes to Avoid

Using browser tool for SSRF-blocked domains → always use curl + proxy
Downloading images without cookies OR Referer → 4KB thumbnails instead of actual images
Slicing with querySelectorAll('img') → captures website UI icons, not newspaper images

✅ Correct: target img.image-content (the actual newspaper image elements)

Inline PowerShell commands with &, --, ||, && → use script files instead

Troubleshooting

Issue	Solution
------------------------	--------------------------------------------------------------------------
Browser SSRF blocked	Use curl + proxy: `curl.exe -x http://127.0.0.1:7897`
Images are 4KB	Missing cookies or Referer header — see Step 2 command pattern
Tesseract not found	Check `C:\Program Files\Tesseract-OCR\tesseract.exe`
PowerShell parsing error	Write command to `.ps1` file, avoid `--`, `&`, `	` inline
Incomplete OCR text	Adjust `--psm 6` mode, or crop ad regions separately
Missed ads	Download all areas and OCR systematically, don't rely on ct= alone

Resources

scripts/

scripts/extract_ocr.py: Extracts OCR text from ad slices using Tesseract
scripts/archive_results.py: Saves results to structured JSON/CSV files

references/

references/classification_rules.json: Customizable rules for ad classification

assets/

assets/ad_keywords.txt: List of keywords to identify ads (can be extended)

版本历史

共 1 个版本

v1.0.0 初始版当前

2026-05-11 01:39 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)