【发布时间】:2018-07-14 22:46:45
【问题描述】:
我需要匹配 pandas 文本列中的 EXACT 子字符串。但是,当该数据框文本列有重复条目时,我得到: ValueError: cannot reindex from a duplicate axis。
我查看了以下帖子以确定如何查询行,但主要是关于 匹配整个条目而不是子字符串。 Select rows from a DataFrame based on values in a column in pandas
以下帖子展示了如何使用正则表达式模式查找子字符串,这正是我需要查找正则表达式单词边界和 我在下面使用的内容。 How to filter rows containing a string pattern from a Pandas dataframe
我能够从上面的第二个 SO 帖子中获取代码,除非我的代码中有重复项 评论栏。请注意,下面的 debug.txt 文件中的条目 600 和 700 是骗人的,这很好。这些骗子是预料之中的,那么我该如何容纳它们呢?
数据文件“debug.txt”,因此数据框有 2 个唯一列,所以这不是数据框问题,每个帖子都有重复的列名: 来源:ValueError: cannot reindex from a duplicate axis using isin with pandas
--debug.txt -----
PKey, Comments
100,Bad damaged product need refund.
200,second item
300,a third item goes here
400,Outlier text
500,second item
600,item
700,item
我的代码如下。您可以提供解决上述 ValueError 的任何帮助,我们将不胜感激。
import re
import pandas as pd
# Define params used below
fileHeader = True
dictB = {}
inputFile = open("debug.txt", 'r')
if fileHeader == True:
inputFile.readline()
for line in inputFile:
inputText = line.split(",")
primaryKey = inputText[0]
inputTexttoAnalyze = inputText[1]
# Clean inputTexttoAnalyze and do other things...
# NOTE: Very inefficient to add 1 row at a time to a Pandas DF.
# They suggest combining the data in some other variable (like my dictionary)
# then copy it to the DF.
# Source: https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe
dictB[primaryKey] = inputTexttoAnalyze
inputFile.close()
# Below is a List of words that must produce an EXACT match to a *substring* within
# the data frame Comments column.
findList = ["damaged product", "item"]
print("\nResults should ONLY have", findList, "\n")
dfB = pd.DataFrame.from_dict(dictB, orient='index').reset_index()
dfB.rename(columns={'index': 'PKey', 0: 'Comments'}, inplace=True)
for entry in findList:
rgx = '({})'.format("".join(r'(\b%s\b)' % entry))
# The following line gives the error: ValueError: cannot reindex from a duplicate axis.
# I DO have expected duplicate values in my input file.
resultDFb = dfB.set_index('Comments').filter(regex=rgx, axis=0)
for key in resultDFb['PKey']:
print(entry, key)
# This SO post says to run .index.duplicated() to see duplicated results, but I # don't see any, which is odd since there ARE duplicate results.
# https://stackoverflow.com/questions/38250626/valueerror-cannot-reindex-from-a-duplicate-axis-pandas
print(dfB.index.duplicated())
【问题讨论】:
-
错误原因很可能与重复的索引项有关。即您设置为索引的列 Comments 有重复项。因此错误。我会说当索引(列或索引键)保证是唯一的时,可以安全地使用过滤器。否则。 df[series.str.contains(pattern)] 是更安全的选择
-
下面的代码给了我我需要的结果,但是我也很好奇你的建议。我试图在上面的代码中将它应用到我的数据框 --> dfB[series.str.contains(pattern)] 但收到错误“NameError: name 'series' is not defined”。 “系列”有什么内容?