Python：替换文件中的多个项目答案

【问题标题】：Python: Substitute multiple items in filePython：替换文件中的多个项目
【发布时间】：2020-01-28 23:59:42
【问题描述】：

给了我两个带有数据的文本文件。文件 A 的数据不正确，文件 B 的数据正确。使用 Pandas 库，我能够找到不匹配的地方（~17000！）。现在我想修改文件 A 并用正确的字段替换不正确的字段。例如

File A (Incorrect)
Name = PARAMETER_1
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = WRONG1!

Name = PARAMETER_2
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = WRONG2!
etc.

应替换为：

File A (Correct)
Name = PARAMETER_1
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = CORRECT1!

Name = PARAMETER_2
Field_1 = a
Field_2 = b
Field_3 = c
Field_4 = CORRECT2!
etc.

Dataframe 看起来像：

   Parameter    Wrong    Correct    Match
0  PARAMETER_1  WRONG1!  CORRECT1!  False
1  PARAMETER_2  WRONG2!  CORRECT2!  False
  etc.

我尝试使用 for 循环：

# read file A
with open(file_A_loc, 'r') as f:
        data_text = f.read()

for row in df.itertuples():
    new = re.sub(r'(?<=Name = ' + row[1] + r')([\w\W]+?Field_4 = )([\w]+)', r'\g<1>'+row[3], data_text, flags=re.I)

您可以想象，这花费了很长时间（文件 A 约为 40-50MB）。有什么建议可以加快这个过程吗？在提交问题之前，我浏览了 stackoverflow 页面并找到了使用字典的参考。我尝试使用这种方法，但得到了 KeyError：

def foo(rep_dict, text): 

  # Create a regular expression  from the dictionary keys
    regex = re.compile('|'.join(rep_dict.keys()), flags=re.I)

  # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda x: rep_dict[x.group(0)], text)

rep_dict = {
            r'(?<=Name = ' + 'PARAMETER_1' + r')([\w\W]+?Field_4 = )([\w]+)':r'\g<1>'+'CORRECT1!',
            r'(?<=Name = ' + 'PARAMETER_2' + r')([\w\W]+?Field_4 = )([\w]+)':r'\g<1>'+'CORRECT2!'
           }
bar = foo(rep_dict, data_text)
print(bar)

附：请原谅我的任何降价违规行为。

更新： 我尝试实现here 和here 方法。不过仍然需要很长时间。至少它解决了我之前遇到的 KeyError。

【问题讨论】：

你为什么要关心文件 A？错了，文件B是对的，为什么不直接用文件B呢？
@JohnGordon 文件 A 随后将在其他脚本中用于提取相关数据。在本例中，文件 B 仅列出有关 Field_4 的信息。此外，文件 A 和文件 B 的措辞/格式也不同。
无需后视即可尝试。捕获名称和字段_1、2 和 3 以及字段 4，直到等号之后。然后使用 \w+ 匹配最后一个单词，并在替换中仅使用第 1 组，然后使用您要使用的替换。 \b(Name = PARAMETER_1.*(?:\r?\nField_[1-3].*)*\r?\nField_4 = )\w+见regex101.com/r/YRrwpw/1
@Thefourthbird 感谢您的出色建议！
@DannOfThursday 速度有提升吗？

标签： python regex

【解决方案1】：

我使用以下基本算法解决了我的问题：

使用 re.findall 捕获文件 A 中的所有内容，并获取以下形式的列表： ['Name = PARAMETER1...Field_4 = WRONG1', 'Name = PARAMETER2...Field_4 = WRONG2', ...]
使用 Pandas 获取文件 A 和文件 B 之间的差异。
使用 df.itertuples 遍历行。使用 Pandas 数据框中的索引将 re.sub 应用于步骤 1 中获得的列表中的特定元素。

在我的用例中，此方法需要大约 9-10 秒才能运行！

【讨论】：