← 返回
未分类

tomark-skill

>批量将 .doc/.docx/.wps 文档转换为 Markdown,docx → mammoth;.doc/.wps → WPS COM → .docx → mammoth 特色 彻底解决中文乱码问题,支持 WPS 专有格式
user_f474679a
未分类 community v1.0.0 1 版本 99778.8 Key: 无需
★ 0
Stars
📥 451
下载
💾 129
安装
1
版本
#latest

概述

tomark-skill — 文档批量转 Markdown

Overview

To convert Word/WPS documents (.doc, .docx, .wps) to Markdown (.md) format.

Supported formats:

  • .docx — converted via mammoth (preserves headings, bold, lists)
  • .doc / .wps — converted via WPS Office COM (KWps.Application) to .docx first, then mammoth
  • Batch or single file, with sub-directory structure preserved

Prerequisites (Windows):

  • Python 3.x
  • mammoth library: pip install mammoth
  • pywin32 library: pip install pywin32
  • WPS Office installed (for .doc/.wps files)

Workflow Decision Tree

User request
    │
    ├─ Single file?
    │       └─ Call convert_file() directly
    │
    └─ Folder / batch?
            └─ Run scripts/convert_to_markdown.py with SRC_DIR set

Step 1 — Check Prerequisites

To verify the environment before running:

import shutil, importlib

# Check mammoth
try:
    import mammoth
    print("mammoth: ok")
except ImportError:
    print("mammoth: MISSING — run: pip install mammoth")

# Check pywin32 (for .doc/.wps)
try:
    import win32com.client
    print("pywin32: ok")
except ImportError:
    print("pywin32: MISSING — run: pip install pywin32")

# Check WPS Office
import glob
wps = glob.glob("C:/Program Files (x86)/Kingsoft/WPS Office/*/office6/wps.exe")
print("WPS Office:", wps[0] if wps else "NOT FOUND")

If WPS Office is not installed, .doc/.wps files cannot be converted.

Only .docx files can be processed with mammoth alone.


Step 2 — Configure and Run

To run the batch conversion script:

  1. Open scripts/convert_to_markdown.py
  2. Set SRC_DIR to the source folder path
  3. Optionally set OUT_DIR (default: /markdown_output/)
  4. Run: python -X utf8 scripts/convert_to_markdown.py

To convert a single file inline:

from pathlib import Path
# import the helper functions from the script
from scripts.convert_to_markdown import convert_file

src = Path(r"D:\documents\example.doc")
out = Path(r"D:\documents\example.md")
ok, msg = convert_file(src, out)
print(ok, msg)

Step 3 — Interpret the Output

After conversion completes:

  • All .md files are in OUT_DIR, mirroring the original sub-directory structure
  • A 转换报告.md report is generated listing successes and failures
  • Typical failure causes:
  • WPS Office not installed (for .doc/.wps)
  • Password-protected documents
  • Severely corrupted files

Encoding Notes

  • Always run with python -X utf8 on Windows to avoid GBK encoding issues
  • The script forces sys.stdout to UTF-8 internally
  • Output .md files are always written as UTF-8

Resources

scripts/

  • convert_to_markdown.py — main batch conversion script (configure SRC_DIR at the top)

references/

  • format_guide.md — notes on mammoth output format and post-processing tips

版本历史

共 1 个版本

  • v1.0.0 Initial release 当前
    2026-04-10 20:39 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

Nano Pdf

steipete
使用nano-pdf CLI通过自然语言指令编辑PDF
★ 277 📥 115,902
office-efficiency

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 381 📥 144,160
office-efficiency

Gog

steipete
Google Workspace 命令行工具,支持 Gmail、日历、云端硬盘、通讯录、表格和文档。
★ 926 📥 186,739