如何使用 Python 或 Grep 使用 RST 文件查找 + 替换答案

【问题标题】：How to approach find+replace with RST files using Python or Grep如何使用 Python 或 Grep 使用 RST 文件查找 + 替换
【发布时间】：2019-05-29 14:02:24
【问题描述】：

我正在尝试自动查找 + 替换 .rst 文件中的一系列损坏的图像链接。我有一个 csv 文件，其中 A 列是“旧”链接（在 .rst 文件中可以看到），B 列是每一行的新替换链接。

我不能先使用 pandoc 转换为 HTML，因为它会“破坏”第一个文件。我使用 BeautifulSoup 和正则表达式对一组 HTML 文件执行了此操作，但该解析器不适用于我的第一个文件。

一位同事建议尝试 Grep，但我似乎无法弄清楚如何调用 csv 文件来进行“匹配”和切换。

对于 html 文件，它将遍历每个文件，搜索 img 标记并使用 csv 文件作为 dict 替换链接

with open(image_csv, newline='') as f:
reader = csv.reader(f)
next(reader, None)  # Ignore the header row
for row in reader:
    graph_main_nodes.append(row[0])
    graph_child_nodes.append(row[1:])
graph = dict(zip(graph_main_nodes, graph_child_nodes))  # Dict with keys in correct location, vals in old locations

graph = dict((v, k) for k in graph for v in graph[k])

for fixfile in html:
try:
    with open(fixfile, 'r', encoding='utf-8') as f:
        soup = BeautifulSoup(f, 'html.parser')
        tags =  soup.findAll('img')
        for tag in tags:  
            print(tag['src'])
            if tag['src'] in graph.keys():
                tag['src'] = tag['src'].replace(tag['src'], graph[tag['src']])
                replaced_links += 1
                print("Match found!")
            else:
                orphan_links.append(tag["src"])
                print("Ignore")

我想就如何解决这个问题提出一些建议。我很想重新利用我的 BeautifulSoup 代码，但我不确定这是否现实。

【问题讨论】：

您必须将所有出现的old link 替换为new link 还是仅替换一个特定的？
所有出现。大约有 10k 个不同的旧链接，每个在整个文件集中使用 1-5 次。

标签： python restructuredtext

【解决方案1】：

This question 有解析RST 文件的信息，但我认为没有必要。您的问题归结为将textA 替换为textB。您已经有了加载 csv 的图表，所以应该没问题 (credit to this answer)

# Read in the file
filedata = None
with open('fixfile', 'r', encoding='utf-8') as file:
  filedata = file.read()

# Replace the target strings
for old, new in graph.items():
  filedata.replace(old, new)

# Write the file out again
with open('fixfile', 'w', encoding='utf-8') as file:
  file.write(filedata)

这也是sed 或perl 的理想选择。使用类似this answer 的东西还使用this answer 来帮助指定sed 的稀有分隔符。（在测试后将-n 更改为-i 并将p 更改为g 以使其实际保存文件）：

DELIM=$(echo -en "\001");
IFS=","
cat csvFile | while read PATTERN REPLACEMENT  # You feed the while loop with stdout lines and read fields separated by ":"
do
   sed -n "\\${DELIM}${PATTERN}${DELIM},\\${DELIM}${REPLACEMENT}${DELIM}p" fixfile.rst
done

【讨论】：