← 返回
未分类

Chaoxing Download

Download PDF documents from Chaoxing (超星) contest/platform viewer URLs and convert to TXT. Use when user wants to download files from contestyd.chaoxing.com,...
从超星(Chaoxing)平台URL下载PDF并转换为TXT,适用于从contestyd.chaoxing.com等下载文件。
artminding
未分类 clawhub v1.3.0 1 版本 100000 Key: 无需
★ 1
Stars
📥 334
下载
💾 0
安装
1
版本
#latest

概述

Chaoxing Document Downloader (超星文档下载)

Download PDFs from Chaoxing WPS viewer URLs using the getYunFiles API.

Core Principle

Every Chaoxing viewer URL contains an objectid (32-char hex). Call the getYunFiles API to get the direct PDF link — no cookies or auth tokens needed.

Arguments

$ARGUMENTS contains the user's download request — typically one or more entries with page count, name, and viewer URL. Parse them to extract the data.

Download Method

Step 1: Extract objectid from each URL

Find the objectid=([a-f0-9]{32}) parameter in each viewer URL.

Step 2: Call getYunFiles API

For each objectid, call:

https://contestyd.chaoxing.com/app/files/{objectid}/getYunFiles?key=allData

Response JSON contains:

  • data.pdf — direct PDF URL on s3.cldisk.com or s3.ananas.chaoxing.com (preferred)
  • data.download — alternative download URL with auth tokens (fallback)
  • data.filename — original filename
  • data.pagenum — page count

Step 3: Download the PDF

Use the data.pdf URL to download directly. No authentication headers needed.

Save to: ~/Downloads/chaoxing_pdfs/{用户给的名称}.pdf

Step 4: Validate page count

Compare data.pagenum with the user's expected page count. Report any mismatch.

Step 5: Convert PDF to TXT (with OCR fallback)

After downloading each PDF, automatically extract text to a plain text file. Use a two-stage approach: native text extraction first, then OCR fallback for image-based pages.

Prerequisites:

pip install pymupdf rapidocr-onnxruntime

Conversion method (Python):

import sys, os, fitz
from rapidocr_onnxruntime import RapidOCR

if sys.platform == "win32":
    sys.stdout.reconfigure(encoding="utf-8")

ocr = RapidOCR()
pdf_path = "~/Downloads/chaoxing_pdfs/{name}.pdf"
doc = fitz.open(pdf_path)
all_text = []

for i, page in enumerate(doc):
    # Stage 1: Try native text extraction
    native = page.get_text().strip()
    if len(native) > 50:
        all_text.append(f"--- 第{i+1}页 ---\n{native}")
        continue
    # Stage 2: OCR fallback for image-based pages
    pix = page.get_pixmap(dpi=200)
    img_bytes = pix.tobytes("png")
    result, _ = ocr(img_bytes)
    ocr_text = "\n".join([item[1] for item in result]) if result else ""
    label = "OCR" if len(ocr_text) > 0 else "(empty)"
    all_text.append(f"--- 第{i+1}页 [{label}] ---\n{ocr_text}")

doc.close()
full_text = "\n".join(all_text)

with open(pdf_path.replace(".pdf", ".txt"), "w", encoding="utf-8") as f:
    f.write(full_text)

# Summary
native_pages = sum(1 for p in all_text if "[OCR]" not in p and "[empty]" not in p)
ocr_pages = sum(1 for p in all_text if "[OCR]" in p)
print(f"Native: {native_pages}p, OCR: {ocr_pages}p, Total: {len(full_text)} chars")

Output files per download:

  • {name}.pdf — original PDF
  • {name}.txt — plain text extraction (native + OCR pages marked with [OCR])

How it works:

  1. Each page is first checked for native text (text layer PDF)
  2. If native text < 50 chars, the page is rendered to image at 200 DPI and processed by RapidOCR
  3. OCR pages are labeled [OCR] in the output for easy identification
  4. Empty pages (no text and OCR fails) are labeled [empty]

CLI Tool (Alternative)

A CLI tool is available at C:/Users/Cameron/Downloads/chaoxing_dl.py:

# Single download
python ~/Downloads/chaoxing_dl.py "VIEWER_URL" -n "文件名"

# Batch from JSON file
python ~/Downloads/chaoxing_dl.py --batch tasks.json

# With page validation
python ~/Downloads/chaoxing_dl.py "URL" -n "name" --json

# Force overwrite
python ~/Downloads/chaoxing_dl.py "URL" -n "name" -f

Batch JSON format:

[
  {"name": "文件名", "url": "viewer_url_or_objectid", "pages": 22},
  ...
]

Batch Processing (Without CLI Tool)

For multiple downloads without the CLI, use bash loop:

for oid_name in "OBJECTID1:名称1" "OBJECTID2:名称2"; do
  oid="${oid_name%%:*}"; name="${oid_name##*:}"
  info=$(curl -s -L "https://contestyd.chaoxing.com/app/files/$oid/getYunFiles?key=allData")
  pagenum=$(echo "$info" | grep -o '"pagenum":[0-9]*' | cut -d: -f2)
  pdf_url=$(echo "$info" | grep -o '"pdf":"[^"]*"' | head -1 | tr -d '"' | sed 's/^pdf://')
  echo "$name: ${pagenum}p"
  curl -s -L -o ~/Downloads/chaoxing_pdfs/${name}.pdf "$pdf_url"
done

Key Notes

  • Only objectid is needed — no resid, tk, addPointInfo, or cookies
  • Always validate page count against user expectation
  • The PDF URLs on s3.cldisk.com are direct links, publicly accessible
  • If data.pdf is empty, fall back to data.download
  • Skip files that already exist unless user specifies overwrite

版本历史

共 1 个版本

  • v1.3.0 当前
    2026-05-07 07:30 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,223 📥 267,442
ai-intelligence

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,073 📥 806,307
ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,374 📥 319,869