An .epub file is simply a ZIP archive with a specific internal structure. The most reliable way to process any epub is:
.epub → .zipnav.xhtml, nav.html, toc.ncx)This approach works 100% of the time and requires no special epub libraries.
# Copy uploaded file to working directory
cp /mnt/user-data/uploads/book.epub /home/claude/book.epub
# Rename to .zip and extract
cp /home/claude/book.epub /home/claude/book.zip
unzip -o /home/claude/book.zip -d /home/claude/book_extracted/
# List the extracted contents
find /home/claude/book_extracted/ -type f | sort
The navigation file is the table of contents — it tells you the book's structure, chapter order, and file layout. Always find and read this first.
# Look for nav files (in priority order)
find /home/claude/book_extracted/ -type f \( \
-name "nav.xhtml" -o \
-name "nav.html" -o \
-name "toc.ncx" -o \
-name "*nav*" -o \
-name "*toc*" \
\) | sort
Nav file priority order:
nav.xhtml or nav.html — EPUB3 navigation document (preferred)toc.ncx — EPUB2 navigation control file (older format)# Read the nav file to understand structure
cat /home/claude/book_extracted/OEBPS/nav.xhtml
# or
cat /home/claude/book_extracted/EPUB/nav.html
The .opf file (Open Packaging Format) contains metadata and the full reading order manifest.
# Find the OPF file
find /home/claude/book_extracted/ -name "*.opf" | head -5
# Read it for metadata and spine (reading order)
cat /home/claude/book_extracted/OEBPS/content.opf
The element in the OPF file defines chapter reading order. The block has title, author, language, etc.
# Find all HTML/XHTML content files
find /home/claude/book_extracted/ -type f \( -name "*.html" -o -name "*.xhtml" \) | sort
# Read a specific chapter
cat /home/claude/book_extracted/OEBPS/chapter01.xhtml
To extract clean text from HTML content:
from bs4 import BeautifulSoup
with open("/home/claude/book_extracted/OEBPS/chapter01.xhtml", "r", encoding="utf-8") as f:
soup = BeautifulSoup(f.read(), "html.parser")
# Remove script/style tags
for tag in soup(["script", "style"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)
print(text)
book_extracted/
├── mimetype ← Must contain "application/epub+zip"
├── META-INF/
│ └── container.xml ← Points to the OPF file
└── OEBPS/ (or EPUB/, or OPS/)
├── content.opf ← Package manifest + metadata + spine
├── nav.xhtml ← ★ TABLE OF CONTENTS (read this first!)
├── toc.ncx ← Older TOC format (EPUB2)
├── chapter01.xhtml
├── chapter02.xhtml
├── ...
├── images/
│ └── cover.jpg
├── css/
│ └── styles.css
└── fonts/
cat /home/claude/book_extracted/META-INF/container.xml
This file always points to the root OPF file via .
import os
from bs4 import BeautifulSoup
extracted_dir = "/home/claude/book_extracted/OEBPS"
output_text = []
# Get ordered list of content files from OPF spine (or just sort them)
html_files = sorted([
f for f in os.listdir(extracted_dir)
if f.endswith((".html", ".xhtml")) and "nav" not in f.lower()
])
for filename in html_files:
filepath = os.path.join(extracted_dir, filename)
with open(filepath, "r", encoding="utf-8", errors="ignore") as f:
soup = BeautifulSoup(f.read(), "html.parser")
for tag in soup(["script", "style", "head"]):
tag.decompose()
text = soup.get_text(separator="\n", strip=True)
output_text.append(f"\n\n--- {filename} ---\n\n{text}")
full_text = "\n".join(output_text)
with open("/mnt/user-data/outputs/book_full_text.txt", "w", encoding="utf-8") as f:
f.write(full_text)
import xml.etree.ElementTree as ET
tree = ET.parse("/home/claude/book_extracted/OEBPS/content.opf")
root = tree.getroot()
# Namespace handling
ns = {
"opf": "http://www.idpf.org/2007/opf",
"dc": "http://purl.org/dc/elements/1.1/"
}
metadata = root.find("opf:metadata", ns)
if metadata is not None:
title = metadata.findtext("dc:title", namespaces=ns)
author = metadata.findtext("dc:creator", namespaces=ns)
lang = metadata.findtext("dc:language", namespaces=ns)
pub = metadata.findtext("dc:publisher",namespaces=ns)
date = metadata.findtext("dc:date", namespaces=ns)
print(f"Title: {title}")
print(f"Author: {author}")
print(f"Language: {lang}")
print(f"Publisher: {pub}")
print(f"Date: {date}")
from bs4 import BeautifulSoup
with open("/home/claude/book_extracted/OEBPS/nav.xhtml", "r", encoding="utf-8") as f:
soup = BeautifulSoup(f.read(), "html.parser")
# Find the nav element with epub:type="toc"
nav = soup.find("nav", attrs={"epub:type": "toc"}) or soup.find("nav")
if nav:
print("=== Table of Contents ===")
for a in nav.find_all("a"):
print(f" {a.get_text(strip=True)} → {a.get('href', '')}")
import xml.etree.ElementTree as ET
tree = ET.parse("/home/claude/book_extracted/OEBPS/toc.ncx")
root = tree.getroot()
ns = {"ncx": "http://www.daisy.org/z3986/2005/ncx/"}
print("=== Table of Contents (NCX) ===")
for navpoint in root.findall(".//ncx:navPoint", ns):
label = navpoint.findtext("ncx:navLabel/ncx:text", namespaces=ns)
src = navpoint.find("ncx:content", ns)
href = src.get("src") if src is not None else ""
print(f" {label} → {href}")
# Find the cover image
find /home/claude/book_extracted/ -type f \( \
-name "cover*" -o -name "*cover*" \
\) | grep -iE "\.(jpg|jpeg|png|gif|webp)$"
import shutil
# Copy cover to output
shutil.copy(
"/home/claude/book_extracted/OEBPS/images/cover.jpg",
"/mnt/user-data/outputs/cover.jpg"
)
If you've edited files inside the extracted folder and want to repack:
cd /home/claude/book_extracted/
# mimetype MUST be first and uncompressed
zip -0 -X /home/claude/modified_book.epub mimetype
# Add everything else
zip -r /home/claude/modified_book.epub . --exclude mimetype
# Copy to output
cp /home/claude/modified_book.epub /mnt/user-data/outputs/modified_book.epub
| Goal | File to Read | Tool |
|---|---|---|
| ------ | ------------- | ------ |
| Understand structure | META-INF/container.xml → OPF path | cat / xml.etree |
| Table of contents | nav.xhtml or nav.html (EPUB3) | BeautifulSoup |
| Table of contents (old) | toc.ncx (EPUB2) | xml.etree |
| Book metadata | *.opf block | xml.etree |
| Reading order | *.opf block | xml.etree |
| Chapter text | .xhtml / .html in OEBPS/ | BeautifulSoup |
| Cover image | images/cover.* or OPF | shutil.copy |
pip install beautifulsoup4 lxml --break-system-packages
unzip is available by default on the system. No special epub library is needed.
"No nav file found" — Try find . -name ".xhtml" -o -name ".html" | xargs grep -l "epub:type" 2>/dev/null to locate the navigation doc.
Encoding errors — Always use encoding="utf-8", errors="ignore" when opening HTML/XML files from epubs.
Namespace issues in XML — EPUB uses multiple XML namespaces. When using xml.etree, always pass the ns dict to find/findall, or use {namespace_uri}tagname syntax directly.
Unusual directory layout — Check META-INF/container.xml first; it always provides the canonical path to the root OPF file, regardless of directory naming conventions.
共 1 个版本