将句子的字符串表示列表转换为词汇集答案

【问题标题】：Convert list of string representations of sentences into vocabulary set将句子的字符串表示列表转换为词汇集
【发布时间】：2018-10-17 16:34:54
【问题描述】：

我有一个句子的字符串表示列表，看起来像这样：

original_format = ["This is a question", "This is another question", "And one more too"]

我想将此列表转换为我的语料库中的一组唯一单词。鉴于上面的列表，输出将如下所示：

{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}

我已经想出了一个方法来做到这一点，但它需要很长时间才能运行。我对从一种格式转换为另一种格式的更有效方式感兴趣（特别是因为我的实际数据集包含超过 20 万个句子）。

仅供参考，我现在正在做的是为词汇创建一个空集，然后循环遍历每个句子（由空格分隔）并与词汇集联合。使用上面定义的 original_format 变量，它看起来像这样：

vocab = set()
for q in original_format:
    vocab = vocab.union(set(q.split(' ')))

你能帮我更有效地运行这个转换吗？

【问题讨论】：

如何存储数据集？最初是什么格式的？
是 SQL 数据库中的完整句子字符串。所以我有一列“问题”，该列中的单元格可能看起来像“这是一个问题吗？”。我通过 pandas 拉到 python，然后将问题的数据框转换为这种格式。
哦，那么肯定有一种更快的方法可以找到唯一的单词。可能最好的方法是从 SQL 列中选择所有不同的单词。
试试这个 SQL 查询：sqlfiddle.com/#!9/5d8a55/1 将完全避免转换您的数据。
这太棒了，克里斯。感谢您向我展示该查询！

标签： python string python-3.x list nlp

【解决方案1】：

您可以将itertools.chain 与set 一起使用。这避免了嵌套的for 循环和list 构造。

from itertools import chain

original_format = ["This is a question", "This is another question", "And one more too"]

res = set(chain.from_iterable(i.split() for i in original_format))

print(res)

{'And', 'This', 'a', 'another', 'is', 'more', 'one', 'question', 'too'}

或者对于真正实用的方法：

from itertools import chain
from operator import methodcaller

res = set(chain.from_iterable(map(methodcaller('split'), original_format)))

【讨论】：

使用 itertools.chain() 的运行速度比映射拆分函数快约 11%。这两种方法都比集合推导运行得更快。谢谢，jpp！

【解决方案2】：

使用简单的集合推导：

{j for i in original_format for j in i.split()}

输出：

{'too', 'is', 'This', 'And', 'question', 'another', 'more', 'one', 'a'}

【讨论】：