【问题标题】:How to efficiently search for a list of strings in another list of strings using Python?如何使用 Python 在另一个字符串列表中有效地搜索字符串列表?
【发布时间】:2019-06-26 18:40:14
【问题描述】:

我有两个名称(字符串)列表,如下所示:

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']

analysts = ['Justin Post', 'Some Dude', 'Some Chick']

我需要在如下所示的字符串列表中找到这些名称的出现位置:

str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores.

我需要这样做的原因是我可以将对话字符串连接在一起(由名称分隔)。我将如何有效地做到这一点?

我查看了一些类似的问题并尝试了解决方案无济于事,例如:

if any(x in str for x in executives):
    print('yes')

还有这个……

match = next((x for x in executives if x in str), False)
match

【问题讨论】:

  • 为什么首先有两个名称列表,而您的代码只遍历其中一个?
  • 这是一个问答记录,所以名字是划分问题和答案的最简单方法。这些名字还告诉我谁在问问题(分析师)和谁在回答(高管)。如果这使操作更容易/更有效,我也可以将名称放入字典中。
  • 想要的输出是什么?
  • 检查这个答案,希望这对你有帮助。 stackoverflow.com/questions/4843158/…
  • 最终,所需的输出将是一个新的字符串列表,如下所示:str = ['Question Asker', 'blah blah blah. blah blah blah.','应答者,','blah blah blah blah blah blah。等等等等。','问题提问者','等等等等。等等等等。','回答者','等等等等']。关键区别在于每个名称后面只有一个字符串,而不是多个。

标签: python python-3.x string performance search


【解决方案1】:

我不确定这是否是您正在寻找的:

executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
text = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores."]

result = [s for s in text if any(ex in s for ex in executives)]
print(result)

输出: ['Brian Olsavsky - Amazon.com']

【讨论】:

  • 太完美了!谢谢!它暴露了我推理中的一个缺陷,即一些名字出现在问题中,但它确实完美地解决了我在文本中查找名字的问题。应该能够进行一些细微的修改,使其很容易解决新问题。
  • @RagnarLothbrok,我很高兴它对你有用。请再看一下代码,我用相同的响应稍微更改它。
【解决方案2】:
str = ['Justin Post - Bank of America',
 "Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.", 
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
 'Brian Olsavsky - Amazon.com',
 "Thank you, Justin. Yeah, let me just remind you a couple of things from last year.", 
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
 "I'll just remind you that the units  those do not count",
 "In-stock is very strong, especially as we head into the holiday period.",
 'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores"]

executives = ['Brian Olsavsky', 'Justin', 'Some Guy', 'Some Lady']

另外,如果你需要确切的位置,你可以使用这个:

print([[i, str.index(q), q.index(i)] for i in executives for q in str if i in q ])

这个输出

[['Brian Olsavsky', 3, 0], ['Justin', 0, 0], ['Justin', 4, 11], ['Justin', 9, 5]]

【讨论】:

  • 这也很有用。谢谢!这可能对我需要执行的一些操作有所帮助。
  • @RagnarLothbrok 我很高兴能提供帮助
【解决方案3】:

TLDR

这个答案的重点是效率。如果不是关键问题,请使用其他答案。如果是,请从您正在搜索的语料库中创建一个dict,然后使用此字典查找您要查找的内容。


#import stuff we need later

import string
import random
import numpy as np
import time
import matplotlib.pyplot as plt

创建示例语料库

首先,我们创建一个要搜索的字符串列表。

创建随机的单词,我的意思是随机的字符序列,长度取自Poisson distribution,使用这个函数:

def poissonlength_words(lam_word): #generating words, length chosen from a Poisson distrib
    return ''.join([random.choice(string.ascii_lowercase) for _ in range(np.random.poisson(lam_word))])

lam_wordPoisson distribution的参数。)

让我们从这些单词创建number_of_sentences可变长度句子(我的意思是句子随机生成的单词列表 以空格分隔)。

句子的长度也可以从Poisson distribution中得出。

lam_word=5
lam_sentence=1000
number_of_sentences = 10000

sentences = [' '.join([poissonlength_words(lam_word) for _ in range(np.random.poisson(lam_sentence))])
             for x in range(number_of_sentences)]

sentences[0] 现在会这样开始:

tptt lxnwf iem fedg wbfdq qaa aqrys szwx zkmukc...

让我们也创建名称,我们将搜索。让这些名称为bigrams名字(即二元组的第一个元素)将是n 字符,(第二个二元组元素)将是m 字符长,它将由随机字符:

def bigramgen(n,m):
    return ''.join([random.choice(string.ascii_lowercase) for _ in range(n)])+' '+\
           ''.join([random.choice(string.ascii_lowercase) for _ in range(m)])

任务

假设我们要查找 sentences,其中 bigramsab c 出现。我们不想找到dab cab cd,只在ab c 独立存在的地方。

为了测试一种方法的速度,让我们找出不断增加的二元组,并测量经过的时间。例如,我们搜索的二元组数可以是:

number_of_bigrams_we_search_for = [10,30,50,100,300,500,1000,3000,5000,10000]
  • 蛮力法

只需遍历每个二元组,遍历每个句子,使用in 查找匹配项。同时,measure elapsed timetime.time()

bruteforcetime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
    bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
    start = time.time()
    for bigram in bigrams:
        #the core of the brute force method starts here
        reslist=[]
        for sentencei, sentence in enumerate(sentences):
            if ' '+bigram+' ' in sentence:
                reslist.append([bigram,sentencei])
        #and ends here
    end = time.time()
    bruteforcetime.append(end-start)

bruteforcetime 将保存找到 10、30、50 ... 二元组所需的秒数。

警告:对于大量的二元组,这可能需要很长时间。

  • 对你的东西进行排序以使其更快方法

让我们为任何句子中出现的每个单词创建一个空集(使用dict comprehension):

worddict={word:set() for sentence in sentences for word in sentence.split(' ')}

在每个集合中,添加它出现的每个单词的index

for sentencei, sentence in enumerate(sentences):
    for wordi, word in enumerate(sentence.split(' ')):
        worddict[word].add(sentencei)

请注意,无论我们稍后搜索多少个二元组,我们只执行一次。

使用这本词典,我们可以搜索出现二元组每个部分的句子。这是非常快的,因为调用了dict value is very fast。然后我们take the intersection of these sets。当我们搜索ab c时,我们会有一组句子索引,其中abc都出现了。

for bigram in bigrams:
    reslist=[]
    setlist = [worddict[gram] for gram in target.split(' ')]
    intersection = set.intersection(*setlist)
    for candidate in intersection:
        if bigram in sentences[candidate]:
            reslist.append([bigram, candidate])

让我们把整个事情放在一起,测量经过的时间:

logtime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
    
    bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
    
    start_time=time.time()
    
    worddict={word:set() for sentence in sentences for word in sentence.split(' ')}

    for sentencei, sentence in enumerate(sentences):
        for wordi, word in enumerate(sentence.split(' ')):
            worddict[word].add(sentencei)

    for bigram in bigrams:
        reslist=[]
        setlist = [worddict[gram] for gram in bigram.split(' ')]
        intersection = set.intersection(*setlist)
        for candidate in intersection:
            if bigram in sentences[candidate]:
                reslist.append([bigram, candidate])

    end_time=time.time()
    
    logtime.append(end_time-start_time)

警告:对于大量的二元组,这可能需要很长时间,但比蛮力方法要少。


结果

我们可以绘制出每种方法花费了多少时间。

plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')

或者,在log scale 上绘制y axis

plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.yscale('log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')

给我们图:

制作worddict 字典需要很多时间,并且在搜索少量名称时是一个劣势。然而,有一点是,语料库足够大,我们正在搜索的名称数量也足够多,与蛮力方法相比,这次可以通过在其中搜索的速度来补偿。所以,如果满足这些条件,我推荐使用这种方法。


(笔记本可用here。)

【讨论】:

猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2019-08-26
  • 2020-03-12
  • 1970-01-01
  • 1970-01-01
  • 2014-08-30
  • 1970-01-01
  • 2015-10-10
相关资源
最近更新 更多