存储、操作和检索 docx 文件的内容，保留格式答案

【问题标题】：Store, manipulate and retrieve content of docx files, retaining formatting存储、操作和检索 docx 文件的内容，保留格式
【发布时间】：2022-01-24 17:38:18
【问题描述】：

所以我需要一种方法来检索 docx 文件的内容（文本、图像、格式），存储它们，然后生成一个新的 docx，其中一些文件的内容拼接在一起。

我目前的做法是，我从底层document.xml 中提取<body>，将其存储在Pandas DF 中，并使用DF 的数据形式修改模板docx 的内容，然后再生成新的docx。

将文件主体存储在 Pandas DF 中似乎很容易：

    def get_word_xml(docx_filename):
       with open(docx_filename, 'rb') as f:
          zip = zipfile.ZipFile(f)
          xml_content = zip.read('word/document.xml')
       return zip, tmp_dir, xml_content
    
    def get_xml_tree(xml_string):
       return etree.fromstring(xml_string)

df = pd.DataFrame(columns=['Name', 'Text'])
for root, dirs, files in os.walk("./docs", topdown=False):
    for name in files:
        zip, tmp_dir, wordxml = get_word_xml(os.path.join(root, name).replace("\\","/"))
        wordxml = get_xml_tree(wordxml)
        wordxml = etree.tostring(wordxml, pretty_print=True)
        body = re.search("(?<=<w:body>)(.*)(?=<\/w:body>)",str(wordxml)).group(1)
        df = df.append({'Name':name.split('.')[0], 'Text':body}, ignore_index=True)

然而，我面临的实际问题是，生成 docx 文件会导致文件损坏。我尝试打开一个文件，提取内容（此时甚至不操作数据）并生成一个具有相同内容的新文件（基本上是一个副本）：

with open('Test.docx', 'rb') as f:
      zip = zipfile.ZipFile(f)
      xml_content = zip.read('word/document.xml')
      tmp_dir = tempfile.mkdtemp()
      zip.extractall(tmp_dir)

etree.fromstring(xml_content)

with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f:
    xmlstr = str(xml_content)
    f.write(str(xmlstr))

    filenames = zip.namelist()
    zip_copy_filename = 'output.docx'
    with zipfile.ZipFile(zip_copy_filename, "w") as docx:
        for filename in filenames:
            docx.write(os.path.join(tmp_dir,filename), filename)

    
shutil.rmtree(tmp_dir)

我什至不确定这是否是完成这项任务的正确方法，但我使用了this 作为参考。

【问题讨论】：

标签： python python-3.x docx python-docx

【解决方案1】：

你的代码有几个问题：

etree.fromstring(xml_content)

这不会将由 xml_content 创建的 XML Element 分配给任何东西。

xmlstr = str(xml_content)
f.write(str(xmlstr))

首先，您有一个额外的str 转换。其次，将 XML 转换回字符串的正确方法是通过etree tostring() method。

尝试以下代码 - 在我的 (linux) 系统上，生成的 output.docx 在 LibreOffice Writer 中打开没有问题。（顺便说一句，下次请包括完整的代码，包括导入。）

#! /usr/bin/python3

import zipfile
import tempfile
import xml.etree.ElementTree as etree
import os.path
import shutil

with open('Test.docx', 'rb') as f:
      zip = zipfile.ZipFile(f)
      xml_content = zip.read('word/document.xml')
      tmp_dir = tempfile.mkdtemp()
      zip.extractall(tmp_dir)

xml = etree.fromstring(xml_content)
with open(os.path.join(tmp_dir,'word/document.xml'), 'w') as f:
    xmlstr = etree.tostring(xml, encoding="unicode", xml_declaration=True)
    f.write(xmlstr)

filenames = zip.namelist()
zip_copy_filename = 'output.docx'
with zipfile.ZipFile(zip_copy_filename, "w") as docx:
    for filename in filenames:
        docx.write(os.path.join(tmp_dir,filename), filename)

shutil.rmtree(tmp_dir)

【讨论】：