在 docx 中替换文本并使用 python-docx 保存更改的文件答案

【问题标题】：Text-Replace in docx and save the changed file with python-docx在 docx 中替换文本并使用 python-docx 保存更改的文件
【发布时间】：2013-07-25 06:03:19
【问题描述】：

我正在尝试使用python-docx module 替换文件中的单词并保存新文件，但需要注意的是新文件的格式必须与旧文件完全相同，但替换了单词。我该怎么做？

docx 模块有一个 savedocx，它接受 7 个输入：

文档
核心道具
应用道具
内容类型
网络设置
文字关系
输出

如何使原始文件中的所有内容保持不变，除了替换的单词？

【问题讨论】：

你试过什么？您还可以获得更多帮助，提供指向模块的链接。
github.com/mikemaccana/python-docx
几乎所有这些参数都有很好的默认值。除了outpout 和document，你几乎可以忽略所有东西。

标签： python ms-word docx python-docx

【解决方案1】：

这对我有用：

def docx_replace(old_file,new_file,rep):
    zin = zipfile.ZipFile (old_file, 'r')
    zout = zipfile.ZipFile (new_file, 'w')
    for item in zin.infolist():
        buffer = zin.read(item.filename)
        if (item.filename == 'word/document.xml'):
            res = buffer.decode("utf-8")
            for r in rep:
                res = res.replace(r,rep[r])
            buffer = res.encode("utf-8")
        zout.writestr(item, buffer)
    zout.close()
    zin.close()

【讨论】：

哇，这太棒了。我一直在努力寻找这样的功能。谢谢！

【解决方案2】：

看起来，Python 的 Docx 并不意味着存储包含图像、标题等的完整 Docx，而仅包含文档的内部内容。所以没有简单的方法可以做到这一点。

但是，您可以这样做：

首先，看看docx tag wiki：

它解释了如何解压缩 docx 文件：这是典型文件的样子：

+--docProps
|  +  app.xml
|  \  core.xml
+  res.log
+--word //this folder contains most of the files that control the content of the document
|  +  document.xml //Is the actual content of the document
|  +  endnotes.xml
|  +  fontTable.xml
|  +  footer1.xml //Containst the elements in the footer of the document
|  +  footnotes.xml
|  +--media //This folder contains all images embedded in the word
|  |  \  image1.jpeg
|  +  settings.xml
|  +  styles.xml
|  +  stylesWithEffects.xml
|  +--theme
|  |  \  theme1.xml
|  +  webSettings.xml
|  \--_rels
|     \  document.xml.rels //this document tells word where the images are situated
+  [Content_Types].xml
\--_rels
   \  .rels

Docx 只获取文档的一部分，在 opendocx

方法中

def opendocx(file):
    '''Open a docx file, return a document XML tree'''
    mydoc = zipfile.ZipFile(file)
    xmlcontent = mydoc.read('word/document.xml')
    document = etree.fromstring(xmlcontent)
    return document

它只获取 document.xml 文件。

我建议你做的是：

用**opendocx*获取文档内容
用 advReplace 方法替换 document.xml
以 zip 格式打开 docx，并将 document.xml 内容替换为新的 xml 内容。
关闭并输出压缩文件（重命名为 output.docx）

如果您安装了 node.js，请注意我已经在 DocxGenJS 上工作，它是 docx 文档的模板引擎，该库正在积极开发中，将很快作为一个节点模块发布。

【讨论】：

谢谢，我再次搜索了如何执行您概述的此过程，并发现了另一个问题，该问题显示了如何执行此操作：stackoverflow.com/questions/16867594/…

【解决方案3】：

你在使用here的docx模块吗？

如果是，那么 docx 模块已经公开了诸如 replace、advReplace 等可以帮助您完成任务的方法。有关公开方法的更多详细信息，请参阅source code。

【讨论】：

是的，我正在使用那里的模块。我使用替换来进行更改，但我遇到的问题是我不知道如何将更改保存到文件中。这就是我所拥有的： 1)document=opendocx('sample.docx') 2)replace(document, "todo", "tada") 3) 那么我应该怎么做才能保存对文档示例的更改。 docx 同时保持文档的原始格式？

【解决方案4】：

from docx import Document
file_path = 'C:/tmp.docx'
document = Document(file_path)

def docx_replace(doc_obj, data: dict):
    """example: data=dict(order_id=123), result: {order_id} -> 123"""
    for paragraph in doc_obj.paragraphs:
        for key, val in data.items():
            key_name = '{{{}}}'.format(key)
            if key_name in paragraph.text:
                paragraph.text = paragraph.text.replace(key_name, str(val))
    for table in doc_obj.tables:
        for row in table.rows:
            for cell in row.cells:
                docx_replace(cell, data)

docx_replace(document, dict(order_id=123, year=2018, payer_fio='payer_fio', payer_fio1='payer_fio1'))
document.save(file_path)

【讨论】：

看起来不需要key_name。只需if key in paragraph.text: 即可。但格式似乎会丢失。

【解决方案5】：

我已经派生了一个 python-docx here 的 repo，它保留了 docx 文件中的所有预先存在的数据，包括格式。希望这就是您正在寻找的。p>

【讨论】：

【解决方案6】：

除了@ramil，您必须先转义一些字符，然后才能将它们作为字符串值放入 XML，所以这对我有用：

def escape(escapee):
  escapee = escapee.replace("&", "&amp;")
  escapee = escapee.replace("<", "&lt;")
  escapee = escapee.replace(">", "&gt;")
  escapee = escapee.replace("\"", "&quot;")
  escapee = escapee.replace("'", "&apos;")
return escapee

【讨论】：

【解决方案7】：

上述方法的问题是它们丢失了现有的格式。请参阅我的answer，它执行替换并保留格式。

还有python-docx-template 允许在 docx 模板中使用 jinja2 样式模板。这是documentation的链接

【讨论】：

谢谢，这对我来说完美无缺，文档相当复杂。

【解决方案8】：

我们可以使用 python-docx 将图像保存在 docx 上。 docx 将图像检测为段落。但是对于这一段，文本是空的。所以你可以这样使用。 paragraphs = document.paragraphs for paragraph in paragraphs: if paragraph.text == '': continue

【讨论】：