读取一个文件中的行并查找以另一个 txt 文件中列出的 4 字母字符串开头的所有字符串答案

【问题标题】：Read lines in one file and find all strings starting with 4-letter strings listed in another txt file读取一个文件中的行并查找以另一个 txt 文件中列出的 4 字母字符串开头的所有字符串
【发布时间】：2016-05-30 10:35:35
【问题描述】：

我有 2 个 txt 文件（a 和 b_）。

file_a.txt 包含一长串 4 字母组合（每行一个组合）：

aaaa
bcsg
aacd
gdee
aadw
hwer
etc.

file_b.txt 包含各种长度的字母组合列表（有些带有空格）：

aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
etc.

我正在寻找一个可以让我执行以下操作的 python 脚本：

逐行读取file_a.txt
取每个 4 字母组合（例如 aaai）
读取 file_b.txt 并找到所有以 4 字母组合开头的各种长度的字母组合（例如，aaaibjkes、aaailoiersaaageehikjaaa、aaai loiuwegoiglkjaaaike 等）
将每次搜索的结果打印在一个单独的 txt 文件中，该文件以 4 个字母组合命名。

文件aaai.txt：

aaaibjkes 
aaailoiersaaageehikjaaa
aaailoiuwegoiglkjaaake
etc.

文件 bcsi.txt：

bcspwiopiejowih
bcsiweyoieotpwe
etc.

对不起，我是新手。请有人指出我正确的方向。到目前为止，我只有：

#I presume I will have to use regex at some point
import re

file1 = open('file_a.txt', 'r').readlines()
file2 = open('file_b.txt', 'r').readlines()

#Should I look into findall()?

【问题讨论】：

我认为这个问题与组合无关。当我们谈论组合时，我们谈论的是形成字符串的不同方式。例如长度为 2 的 a、b、c 的组合看起来像 ab bc, ca`
谢谢。那么，我们应该称它们为“字符串”吗？ file_a.txt 和 file_b.txt 中的所有条目？

标签： python regex find substring combinations

【解决方案1】：

希望对你有所帮助；

file1 = open('file_a.txt', 'r')
file2 = open('file_b.txt', 'r')

#get every item in your second file into a list 
mylist = file2.readlines()

# read each line in the first file
while file1.readline():
    searchStr = file1.readline()
    # find this line in your second file
    exists = [s for s in mylist if searchStr in s]
    if (exists):
        # if this line exists in your second file then create a file for it
        fileNew = open(searchStr,'w')
        for line in exists:
            fileNew.write(line)

        fileNew.close()

    file1.close()

【讨论】：

干杯！但是，它不起作用。似乎有以下错误： mylist = file2.readlines() AttributeError: 'list' object has no attribute 'deadlines' 有什么解决办法吗？
@jigitjigit2 是拼写错误还是您在代码中添加了带有 d 的“截止日期”？如果你粘贴 erolkaya84 的代码，如果你不先打开文件，它将不起作用，我会尝试编辑他的帖子

【解决方案2】：

您可以做的是打开这两个文件并使用for 循环逐行运行这两个文件。

您可以有两个for 循环，第一个读取file_a.txt，因为您只会阅读一次。第二个将通读file_b.txt 并在开头查找字符串。

为此，您必须使用.find() 来搜索字符串。由于是开头，所以值应该是0。

file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")

for a_line in file_a:
    # This result value will be written into your new file
    result = ""
    # This is what we will search with
    search_val = a_line.strip("\n")
    print "---- Using " + search_val + " from file_a to search. ----"
    for b_line in file_b:
        print "Searching file_b using " + b_line.strip("\n")
        if b_line.strip("\n").find(search_val) == 0:
            result += (b_line)
    print "---- Search ended ----"
    # Set the read pointer to the start of the file again
    file_b.seek(0, 0)

    if result:
        # Write the contents of "results" into a file with the name of "search_val"
        with open(search_val + ".txt", "a") as f:
            f.write(result)

file_a.close()
file_b.close()

测试用例：

我正在使用您问题中的测试用例：

file_a.txt

aaaa
bcsg
aacd
gdee
aadw
hwer

file_b.txt

aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake

程序生成一个输出文件bcsg.txt，因为它应该在里面包含bcsgiweyoieotpwe。

【讨论】：

没错。我很抱歉我之前的评论不准确。 "bcsgiweyoieotpwe" 是此示例中唯一匹配的字符串。但是为什么程序会为 file_a.txt 中的字符串创建零字节 txt 文件，而 file_b.txt 中没有匹配项呢？再次感谢！
@jigitjigit2 我以前的代码就是这样做的，即使没有匹配项，我也只是写入了一个新文件。我进行了编辑，使其不再那样做。
太棒了！有用！就是这样。有什么简单的方法可以让程序显示当前读取和匹配的字符串流吗？（原始的file_a和file_b很长，很高兴知道我们在这个过程中的位置。现在当我在这些大txt文件上运行脚本时，我只知道脚本正在运行通过观察使用在系统监视器中由 Python 占用 CPU。我看不到正在创建任何新的文本文件。也许是因为它们会太多）。干杯！
@jigitjigit2 as in print out 它目前在做什么？
当我在原始文本文件的较短版本上尝试它时，它可以工作。但是当我使用大的 txt 文件时，它似乎没有做任何事情。目录中没有新的 txt 文件。正在处理的文本文件中的行数是否有限制？

【解决方案3】：

试试这个：

f1 = open("a.txt","r").readlines()
f2 = open("b.txt","r").readlines()
file1 = [word.replace("\n","") for word in f1]
file2 = [word.replace("\n","") for word in f2]

data = []
data_dict ={}
for short_word in file1:
    data += ([[short_word,w] for w in file2 if w.startswith(short_word)])

for single_data in data:
    if single_data[0] in data_dict:
        data_dict[single_data[0]].append(single_data[1])
    else:
        data_dict[single_data[0]]=[single_data[1]]

for key,val in data_dict.iteritems():
    open(key+".txt","w").writelines("\n".join(val))
    print(key + ".txt created")

【讨论】：

谢谢。该脚本运行并且不产生任何错误消息，但它不起作用，即它不创建任何新的 txt 文件。任何想法如何解决它？
在我的例子中创建了 aaai.txt 和 bcsg.txt。
道歉。有用！可能是由于我的原始文本文件很大，所以花了一点时间。非常感谢