在 pandas 文本列中查找 EXACT 子字符串会给出 ValueError: cannot reindex from a duplicate axis答案

【问题标题】：Find an EXACT substring in a pandas text column gives ValueError: cannot reindex from a duplicate axis在 pandas 文本列中查找 EXACT 子字符串会给出 ValueError: cannot reindex from a duplicate axis
【发布时间】：2018-07-14 22:46:45
【问题描述】：

我需要匹配 pandas 文本列中的 EXACT 子字符串。但是，当该数据框文本列有重复条目时，我得到： ValueError: cannot reindex from a duplicate axis。

我查看了以下帖子以确定如何查询行，但主要是关于匹配整个条目而不是子字符串。 Select rows from a DataFrame based on values in a column in pandas

以下帖子展示了如何使用正则表达式模式查找子字符串，这正是我需要查找正则表达式单词边界和 我在下面使用的内容。 How to filter rows containing a string pattern from a Pandas dataframe

我能够从上面的第二个 SO 帖子中获取代码，除非我的代码中有重复项评论栏。请注意，下面的 debug.txt 文件中的条目 600 和 700 是骗人的，这很好。这些骗子是预料之中的，那么我该如何容纳它们呢？

数据文件“debug.txt”，因此数据框有 2 个唯一列，所以这不是数据框问题，每个帖子都有重复的列名：来源：ValueError: cannot reindex from a duplicate axis using isin with pandas

--debug.txt -----

PKey, Comments
100,Bad damaged product need refund.
200,second item
300,a third item goes here
400,Outlier text
500,second item
600,item
700,item

我的代码如下。您可以提供解决上述 ValueError 的任何帮助，我们将不胜感激。

import re
import pandas as pd

# Define params used below
fileHeader = True

dictB = {}

inputFile = open("debug.txt", 'r')

if fileHeader == True:
    inputFile.readline()

for line in inputFile:

    inputText = line.split(",")

    primaryKey = inputText[0]
    inputTexttoAnalyze = inputText[1]

    # Clean inputTexttoAnalyze and do other things...

    # NOTE: Very inefficient to add 1 row at a time to a Pandas DF. 
    # They suggest combining the data in some other variable (like my dictionary)
    # then copy it to the DF. 
    # Source: https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe

    dictB[primaryKey] = inputTexttoAnalyze

inputFile.close()

# Below is a List of words that must produce an EXACT match to a *substring* within 
# the data frame Comments column. 
findList = ["damaged product", "item"]

print("\nResults should ONLY have", findList, "\n")


dfB = pd.DataFrame.from_dict(dictB, orient='index').reset_index()
dfB.rename(columns={'index': 'PKey', 0: 'Comments'}, inplace=True)

for entry in findList:
    rgx = '({})'.format("".join(r'(\b%s\b)' % entry))

    # The following line gives the error: ValueError: cannot reindex from a duplicate axis. 
    # I DO have expected duplicate values in my input file.
    resultDFb = dfB.set_index('Comments').filter(regex=rgx, axis=0)
    for key in resultDFb['PKey']:
        print(entry, key)

# This SO post says to run .index.duplicated() to see duplicated results, but I # don't see any, which is odd since there ARE duplicate results.  
# https://stackoverflow.com/questions/38250626/valueerror-cannot-reindex-from-a-duplicate-axis-pandas

print(dfB.index.duplicated())

【问题讨论】：

错误原因很可能与重复的索引项有关。即您设置为索引的列 Comments 有重复项。因此错误。我会说当索引（列或索引键）保证是唯一的时，可以安全地使用过滤器。否则。 df[series.str.contains(pattern)] 是更安全的选择
下面的代码给了我我需要的结果，但是我也很好奇你的建议。我试图在上面的代码中将它应用到我的数据框 --> dfB[series.str.contains(pattern)] 但收到错误“NameError: name 'series' is not defined”。 “系列”有什么内容？

标签： python pandas dataframe

【解决方案1】：

我看到的一个问题是 Comments 的标头中有一个前导空格（“，评论”），这可能导致 DataFrame 出现问题。

如果我对您的理解正确，您正在尝试识别 DataFrame 中的所有行，其中 Comments 包含 findList 中的值之一

以下可能对您有用（在您从 Comments 标头中删除前导空格之后）。

import pandas as pd
import re

def check(s):
    for item in findList:
        if re.search(r'\b' + item + r'\b', s):
            return True
    return False


findList = ["damaged prod", "item"]

df = pd.read_csv("debug.txt")

df[df.Comments.apply(check)]

Out[9]: 
   PKey                          Comments
1   200                       second item
2   300            a third item goes here
4   500                       second item
5   600                              item
6   700                              item

希望对您有所帮助。

【讨论】：

使用正则表达式将允许您执行完全匹配。我已经编辑了原始建议并包含了正则表达式。