使用 python 从一个文件中提取另一个文件中的行答案

【问题标题】：Use python to grep lines from one file out of another file使用 python 从一个文件中提取另一个文件中的行
【发布时间】：2012-05-08 01:30:34
【问题描述】：

与python中“grep”的替代类似的问题；但这里的复杂性是 grepped 是来自另一个文件的变量（行）。我不知道如何使用 re.findall() 之类的函数来做到这一点

文件1：

1  20  200
1  30  300

文件2：

1  20  200  0.1  0.5
1  20  200  0.3  0.1
1  30  300  0.2  0.6
1  40  400  0.9  0.6
2  50  300  0.5  0.7

file1 中的每一行都是我的模式；我需要从file2中搜索这样的模式。那么结果应该是：

    1  20  200  0.1  0.5
    1  20  200  0.3  0.1
    1  30  300  0.2  0.6

我一直在尝试使用 bash 或 python 来解决问题，但无法弄清楚。谢谢

【问题讨论】：

标签： python grep

【解决方案1】：

这是一个基于非正则表达式的解决方案：

with open('/tmp/file1') as f:
  lines1 = f.readlines()

with open('/tmp/file2') as f:
  for line in f:
    if any(line.startswith(x.strip()) for x in lines1):
      print line,

【讨论】：

接受我的 +1 以使用字符串方法而不是正则表达式。 ;) 改进：我会将line1 设置为一组，这可以加快从O(n) 到O(1) 的成员资格测试。
也许我遗漏了一些东西，但代码从未对 lines1 进行成员资格测试，它只迭代 lines1 的内容？
它在第二个循环内进行测试（在代码的倒数第二行）
@Li-aungYip 你能解释一下你的评论吗，因为我同意 srgerg 这里的评论

【解决方案2】：

您可以利用正则表达式中的 | 字符来匹配其左侧的模式或右侧的模式这一事实：

import re

with open('file1') as file1:
    patterns = "|".join(re.escape(line.rstrip()) for line in file1)

regexp = re.compile(patterns)
with open('file2') as file2:
    for line in file2:
        if regexp.search(line):
            print line.rstrip()

当我在您的示例文件上尝试此操作时，它会输出：

1   20  200 0.1 0.5
1   20  200 0.3 0.1
1   30  300 0.2 0.6

顺便说一句，如果你想在 bash 中解决这个问题，应该这样做：

grep -f file1 file2

【讨论】：

【解决方案3】：

我认为你需要自己的循环

file1patterns = [ re.Pattern(l) for l in f1.readlines() ]
lineToMatch = 0
matchedLines = []
for line in f2.readlines():
  if file1patterns[lineToMatch].matches(line):
    matchedLines += line
    lineToMatch += 1
  else:
    lineToMatch = 0
    matchedLines = []
  if len(matchedLines) == len(file1patterns)
    print matchedLines
    lineToMatch = 0
    matchedLines = []

（不是实际编译 Python，但希望足以让您继续前进）

【讨论】：

【解决方案4】：

第 1 步：读取文件 1 中的所有行，拆分它们并将它们作为元组添加到集合中。这将有助于我们在下一步中进行更快的查找。

with open('file1', 'r') as f:
    file1_lines = set([tuple(line.strip().split()) for line in f])

第 2 步：从 file2 中筛选符合您条件的行，即它们是否以 file1 中的任何行开头：

with open('file2', 'r') as f2:
    for line in itertools.ifilter(lambda x: tuple(x.split()[:3]) in file1_lines, f2):
        print line

【讨论】：