To convert Word/WPS documents (.doc, .docx, .wps) to Markdown (.md) format.
Supported formats:
.docx — converted via mammoth (preserves headings, bold, lists)
.doc / .wps — converted via WPS Office COM (KWps.Application) to .docx first, then mammoth
Prerequisites (Windows):
mammoth library: pip install mammoth
pywin32 library: pip install pywin32
User request
│
├─ Single file?
│ └─ Call convert_file() directly
│
└─ Folder / batch?
└─ Run scripts/convert_to_markdown.py with SRC_DIR set
To verify the environment before running:
import shutil, importlib
# Check mammoth
try:
import mammoth
print("mammoth: ok")
except ImportError:
print("mammoth: MISSING — run: pip install mammoth")
# Check pywin32 (for .doc/.wps)
try:
import win32com.client
print("pywin32: ok")
except ImportError:
print("pywin32: MISSING — run: pip install pywin32")
# Check WPS Office
import glob
wps = glob.glob("C:/Program Files (x86)/Kingsoft/WPS Office/*/office6/wps.exe")
print("WPS Office:", wps[0] if wps else "NOT FOUND")
If WPS Office is not installed, .doc/.wps files cannot be converted.
Only .docx files can be processed with mammoth alone.
To run the batch conversion script:
scripts/convert_to_markdown.py
SRC_DIR to the source folder path
OUT_DIR (default: /markdown_output/ )
python -X utf8 scripts/convert_to_markdown.py
To convert a single file inline:
from pathlib import Path
# import the helper functions from the script
from scripts.convert_to_markdown import convert_file
src = Path(r"D:\documents\example.doc")
out = Path(r"D:\documents\example.md")
ok, msg = convert_file(src, out)
print(ok, msg)
After conversion completes:
.md files are in OUT_DIR, mirroring the original sub-directory structure
转换报告.md report is generated listing successes and failures
python -X utf8 on Windows to avoid GBK encoding issues
sys.stdout to UTF-8 internally
.md files are always written as UTF-8
convert_to_markdown.py — main batch conversion script (configure SRC_DIR at the top)
format_guide.md — notes on mammoth output format and post-processing tips
共 1 个版本