Python - 有效地计算字符串列表中非 ngram 序列的频率答案

【问题标题】：Python - counting frequency of a non-ngram sequence in a list of strings efficientlyPython - 有效地计算字符串列表中非 ngram 序列的频率
【发布时间】：2021-06-08 14:58:13
【问题描述】：

正如我在标题中所述，我正在尝试计算出现在字符串列表中的给定序列列表的短语频率。问题是短语中的单词不必紧挨着其他单词出现，中间可能有一个或多个单词。

例子：

Sequence: ('able', 'help', 'number') in a sentence "Please call us, we may be able to help, our phone number is 1234"

我删除了停用词（NLTK 停用词），删除了标点符号，将所有字母小写并标记了句子，因此处理后的序列看起来像 ['please', 'call', 'us', 'able', 'help', 'phone', 'number', '1234']。我有大约 30,000 个序列，长度从 1（单个单词）到 3 不等，我正在搜索近 6,000 个短句。我目前的做法如下：

from collections import Counter
from tqdm import tqdm
import nltk

# Get term sequency per sentence
def get_bow(sen, vocab):

    vector = [0] * len(vocab)
    tokenized_sentence = nltk.word_tokenize(sen)
    combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
                                                   itertools.combinations(tokenized_sentence, 2),
                                                   itertools.combinations(tokenized_sentence, 3)]))
    for el in combined_sentence:
        if el in vocab:
            cnt = combined_sentence.count(el)
            idx = vocab.index(el)
            vector[idx] = cnt
    return vector

sentence_vectors = []
for sentence in tqdm(text_list):
    sent_vec = get_bow
    sentence_vectors.append(get_bow(sentence, phrase_list))

phrase_list 是包含序列的元组列表，text_list 是字符串列表。目前，计算频率需要超过 1 小时，我正在尝试找到更有效的方法来获取与给定术语相关的频率列表。我也尝试过使用 sklearn 的 CountVectorizer，但是处理有间隙的序列存在问题，而且根本没有计算出来。

如果有人能给我一些关于如何使我的脚本更高效的见解，我将不胜感激。提前致谢！

编辑：

phrase_list 的示例：[('able',), ('able', 'us', 'software'), ('able', 'back'), ('printer', 'holidays'), ('printer', 'information')]

text_list 的示例：['able add printer mac still working advise calling support team mon fri excluding bank holidays would able look', 'absolutely one cat coyote peterson', 'accurate customs checks cause delays also causing issues expected delivery dates changing', 'add super mario flair fridge desk supermario dworld bowsersfury magnet set available platinum points shipping costs mynintendo reward get', 'additional information though pass comments team thanks']

预期输出：[2, 0, 0, 1, 0] - 每个短语出现次数的向量，值的顺序应与phrase_list 中的相同。我的代码返回每个句子中出现的短语的向量，因为我试图实现类似词袋之类的东西。

【问题讨论】：

您能否提供“phrase_list”和“text_list”的示例数据以及预期输出？
嗨，我已经编辑了问题并添加了一些示例。

标签： python pandas

【解决方案1】：

有很多方面可以做得更快，但这里是主要问题：

combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
                                               itertools.combinations(tokenized_sentence, 2),
                                               itertools.combinations(tokenized_sentence, 3)]))

您生成句子的 1,2 或 3 个单词的所有可能组合。无论您想做什么，这总是很糟糕。

句子：“尤达大师关于句子结构不顾。”

您确实希望将此句子视为包含“Yoda does not”，那么您仍然不应该生成所有组合。有更快的方法，但我只会在这方面花时间，如果这确实是你的目标。
如果您想将此句子视为不包含“Yoda does”的句子，那么我认为您可以自己弄清楚如何加快代码速度。也许看看here。

我希望这会有所帮助。如果您需要选项 1，请告诉我。

【讨论】：