概述

EPUB Processing Guide

Core Insight: EPUB is a ZIP Archive

An .epub file is simply a ZIP archive with a specific internal structure. The most reliable way to process any epub is:

Copy the file to the working directory
Rename it from .epub → .zip
Unzip it into a folder
Find and read the navigation/TOC file first (e.g. nav.xhtml, nav.html, toc.ncx)
Then read content files as needed

This approach works 100% of the time and requires no special epub libraries.

Step-by-Step Workflow

Step 1: Extract the EPUB

# Copy uploaded file to working directory
cp /mnt/user-data/uploads/book.epub /home/claude/book.epub

# Rename to .zip and extract
cp /home/claude/book.epub /home/claude/book.zip
unzip -o /home/claude/book.zip -d /home/claude/book_extracted/

# List the extracted contents
find /home/claude/book_extracted/ -type f | sort

Step 2: Find the Navigation File (Highest Priority)

The navigation file is the table of contents — it tells you the book's structure, chapter order, and file layout. Always find and read this first.

# Look for nav files (in priority order)
find /home/claude/book_extracted/ -type f \( \
  -name "nav.xhtml" -o \
  -name "nav.html" -o \
  -name "toc.ncx" -o \
  -name "*nav*" -o \
  -name "*toc*" \
\) | sort

Nav file priority order:

nav.xhtml or nav.html — EPUB3 navigation document (preferred)
toc.ncx — EPUB2 navigation control file (older format)
Any file with "nav" or "toc" in its name

# Read the nav file to understand structure
cat /home/claude/book_extracted/OEBPS/nav.xhtml
# or
cat /home/claude/book_extracted/EPUB/nav.html

Step 3: Find the OPF Package File

The .opf file (Open Packaging Format) contains metadata and the full reading order manifest.

# Find the OPF file
find /home/claude/book_extracted/ -name "*.opf" | head -5

# Read it for metadata and spine (reading order)
cat /home/claude/book_extracted/OEBPS/content.opf

The element in the OPF file defines chapter reading order. The block has title, author, language, etc.

Step 4: Read Content Files

# Find all HTML/XHTML content files
find /home/claude/book_extracted/ -type f \( -name "*.html" -o -name "*.xhtml" \) | sort

# Read a specific chapter
cat /home/claude/book_extracted/OEBPS/chapter01.xhtml

To extract clean text from HTML content:

from bs4 import BeautifulSoup

with open("/home/claude/book_extracted/OEBPS/chapter01.xhtml", "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f.read(), "html.parser")
    
# Remove script/style tags
for tag in soup(["script", "style"]):
    tag.decompose()

text = soup.get_text(separator="\n", strip=True)
print(text)

Typical EPUB Directory Structure

book_extracted/
├── mimetype                    ← Must contain "application/epub+zip"
├── META-INF/
│   └── container.xml           ← Points to the OPF file
└── OEBPS/   (or EPUB/, or OPS/)
    ├── content.opf             ← Package manifest + metadata + spine
    ├── nav.xhtml               ← ★ TABLE OF CONTENTS (read this first!)
    ├── toc.ncx                 ← Older TOC format (EPUB2)
    ├── chapter01.xhtml
    ├── chapter02.xhtml
    ├── ...
    ├── images/
    │   └── cover.jpg
    ├── css/
    │   └── styles.css
    └── fonts/

Reading container.xml to find the OPF path

cat /home/claude/book_extracted/META-INF/container.xml

This file always points to the root OPF file via .

Common Tasks

Extract All Text (Full Book)

import os
from bs4 import BeautifulSoup

extracted_dir = "/home/claude/book_extracted/OEBPS"
output_text = []

# Get ordered list of content files from OPF spine (or just sort them)
html_files = sorted([
    f for f in os.listdir(extracted_dir)
    if f.endswith((".html", ".xhtml")) and "nav" not in f.lower()
])

for filename in html_files:
    filepath = os.path.join(extracted_dir, filename)
    with open(filepath, "r", encoding="utf-8", errors="ignore") as f:
        soup = BeautifulSoup(f.read(), "html.parser")
    for tag in soup(["script", "style", "head"]):
        tag.decompose()
    text = soup.get_text(separator="\n", strip=True)
    output_text.append(f"\n\n--- {filename} ---\n\n{text}")

full_text = "\n".join(output_text)
with open("/mnt/user-data/outputs/book_full_text.txt", "w", encoding="utf-8") as f:
    f.write(full_text)

Extract Metadata

import xml.etree.ElementTree as ET

tree = ET.parse("/home/claude/book_extracted/OEBPS/content.opf")
root = tree.getroot()

# Namespace handling
ns = {
    "opf": "http://www.idpf.org/2007/opf",
    "dc":  "http://purl.org/dc/elements/1.1/"
}

metadata = root.find("opf:metadata", ns)
if metadata is not None:
    title   = metadata.findtext("dc:title",    namespaces=ns)
    author  = metadata.findtext("dc:creator",  namespaces=ns)
    lang    = metadata.findtext("dc:language", namespaces=ns)
    pub     = metadata.findtext("dc:publisher",namespaces=ns)
    date    = metadata.findtext("dc:date",     namespaces=ns)
    print(f"Title:     {title}")
    print(f"Author:    {author}")
    print(f"Language:  {lang}")
    print(f"Publisher: {pub}")
    print(f"Date:      {date}")

Parse Table of Contents from nav.xhtml

from bs4 import BeautifulSoup

with open("/home/claude/book_extracted/OEBPS/nav.xhtml", "r", encoding="utf-8") as f:
    soup = BeautifulSoup(f.read(), "html.parser")

# Find the nav element with epub:type="toc"
nav = soup.find("nav", attrs={"epub:type": "toc"}) or soup.find("nav")

if nav:
    print("=== Table of Contents ===")
    for a in nav.find_all("a"):
        print(f"  {a.get_text(strip=True)}  →  {a.get('href', '')}")

Parse TOC from toc.ncx (EPUB2)

import xml.etree.ElementTree as ET

tree = ET.parse("/home/claude/book_extracted/OEBPS/toc.ncx")
root = tree.getroot()
ns = {"ncx": "http://www.daisy.org/z3986/2005/ncx/"}

print("=== Table of Contents (NCX) ===")
for navpoint in root.findall(".//ncx:navPoint", ns):
    label = navpoint.findtext("ncx:navLabel/ncx:text", namespaces=ns)
    src   = navpoint.find("ncx:content", ns)
    href  = src.get("src") if src is not None else ""
    print(f"  {label}  →  {href}")

Extract Cover Image

# Find the cover image
find /home/claude/book_extracted/ -type f \( \
  -name "cover*" -o -name "*cover*" \
\) | grep -iE "\.(jpg|jpeg|png|gif|webp)$"

import shutil

# Copy cover to output
shutil.copy(
    "/home/claude/book_extracted/OEBPS/images/cover.jpg",
    "/mnt/user-data/outputs/cover.jpg"
)

Repack a Modified EPUB

If you've edited files inside the extracted folder and want to repack:

cd /home/claude/book_extracted/

# mimetype MUST be first and uncompressed
zip -0 -X /home/claude/modified_book.epub mimetype

# Add everything else
zip -r /home/claude/modified_book.epub . --exclude mimetype

# Copy to output
cp /home/claude/modified_book.epub /mnt/user-data/outputs/modified_book.epub

Quick Reference

Goal	File to Read	Tool
------	-------------	------
Understand structure	`META-INF/container.xml` → OPF path	`cat` / xml.etree
Table of contents	`nav.xhtml` or `nav.html` (EPUB3)	BeautifulSoup
Table of contents (old)	`toc.ncx` (EPUB2)	xml.etree
Book metadata	`*.opf` block	xml.etree
Reading order	`*.opf` block	xml.etree
Chapter text	`.xhtml` / `.html` in OEBPS/	BeautifulSoup
Cover image	`images/cover.*` or OPF	shutil.copy

Required Python Packages

pip install beautifulsoup4 lxml --break-system-packages

unzip is available by default on the system. No special epub library is needed.

Troubleshooting

"No nav file found" — Try find . -name ".xhtml" -o -name ".html" | xargs grep -l "epub:type" 2>/dev/null to locate the navigation doc.

Encoding errors — Always use encoding="utf-8", errors="ignore" when opening HTML/XML files from epubs.

Namespace issues in XML — EPUB uses multiple XML namespaces. When using xml.etree, always pass the ns dict to find/findall, or use {namespace_uri}tagname syntax directly.

Unusual directory layout — Check META-INF/container.xml first; it always provides the canonical path to the root OPF file, regardless of directory naming conventions.

版本历史

共 1 个版本

v1.0.0 当前

2026-03-30 10:12 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)