熊猫数据框中的二进制搜索？答案

【问题标题】：Binary search in a pandas dataframe?熊猫数据框中的二进制搜索？
【发布时间】：2017-08-12 07:01:24
【问题描述】：

我正在大熊猫数据框中搜索大量单词，但我遇到了性能问题。有没有办法在 pandas 数据框中的列的字符串中进行二进制搜索？

现在我的代码是这样的：

names = pd.DataFrame(data=['one', 'two', 'three', 'four'], index=range(0, 4), columns=['Name'])
sentence = 'There are two trees in the street.'

for word in word_tokenize(sentence):
    # Search for each word in all the names
    new_names = names[names['Name'].str.startswith(word)]
    # then do some operations on the names

但我需要为names[names['Name'].str.startswith(word)] 提供更好的性能，并且我认为我应该找到一种在“名称”列上进行二分搜索的方法。

【问题讨论】：

你到底尝试了什么？您需要提供更多细节。提供带有您尝试过的一些代码的示例 DataFrame 将大有帮助。
@TedPetrou 谢谢！我稍微改变了这个问题。
仍然没有足够的细节来提供答案。 iterrows 下面发生了什么。您通常应该不惜一切代价避免使用iterrows。包含更多信息的示例数据框将大有帮助。
@TedPetrou 我在开头添加了一个示例数据。 iterrows 并不重要。我可以使用其他方法进行下一步操作。主要问题是当它变得太大时在数据框中进行搜索。
@AmirAhmad，您可能需要查看this approach

标签： python string pandas search dataframe

【解决方案1】：

这种方法至少存在两个问题。首先，names['Name'].str.startswith(word) 是为每个单词计算的，尽管它可以被缓存。其次，startswith() 将匹配单词“the”的“There”。翻译成代码，可以这样改：

# calculate startword only once.
startword = names.apply(lambda row: row['Name'].split(" ", 1)[0])

for word in word_tokenize(sentence):
    # also, match by the full word only
    new_names = names[startword == word]

如果 startword 是索引会更快：

names.index = startword
for word in word_tokenize(sentence):
    # also, match by the full word only
    new_names = names.loc[word]

【讨论】：

谢谢！当该起始字有多行时，此 .loc[] 返回一个数据帧，但当只有一行时，它返回一些其他数据类型，我应该区别对待。有解决办法吗？
@AmirAhmad 只需使用names.loc[[word]] 而不是names.loc[word]