这个有点乱,肯定是一些更直接的方法来完成一些步骤,但它适用于您的数据。
第 1 步:我只是 reset_index()(假设索引使用行号)将行号放入列中。
df.reset_index(inplace=True)
然后我编写了一个 for 循环,其目的是检查每个给定值,如果该值在给定列中的任何位置(使用 .str.contains() 函数,如果是,则在哪里。然后将该信息存储在字典。请注意,这里我使用 + 来拆分您搜索的各种值,因为这看起来是数据集中的有效分隔符,但您可以相应地调整它
#the dictionary will have a key containing row number and the value we searched for
#the value will contain the module and line item values
result = {}
#create a rownumber variable so we know where in the dataset we are
rownumber = -1
#now we just iterate over every row of the Formula series
for row in df['Formula']:
rownumber +=1
#and also every relevant value within that cell
for value in row.split('+'):
#we clean the value from trailing/preceding whitespace
value = value.strip()
#and then we return our key and value and update our dictionary
key = 'row:|:'+str(rownumber)+':|:'+value
value = (df.loc[((df.Formula.str.contains(value,regex=False))) & (df.index!=rownumber),['Module','Line Item']])
result.update({key:value})
我们现在可以将字典解压到列表中,我们在那里找到了匹配项:
where_raw = []
what_raw = []
rows_raw = []
for key,value in zip(result.keys(),result.values()):
if 'Empty' in str(value):
continue
else:
where_raw.append(list(value['Module']+' '+value['Line Item']))
what_raw.append(key.split(':|:')[2])
rows_raw.append(int(key.split(':|:')[1]))
tempdf = pd.DataFrame({'row':rows_raw,'where':where_raw,'what':what_raw})
tempdf 现在每个匹配包含一行,但是,我们希望 df 中的每个原始行都有一行,因此我们将每个主行的所有匹配合并为一个
where = []
what = []
rows = []
for row in tempdf.row.unique():
where.append(list(tempdf.loc[tempdf.row==row,'where']))
what.append(list(tempdf.loc[tempdf.row==row,'what']))
rows.append(row)
result = df.merge(pd.DataFrame({'index':rows,'where':where,'what':what}))
最后,我们现在可以通过将结果与原始数据框合并来获得结果
result = df.merge(pd.DataFrame({'index':rows,'where':where,'what':what}),how='left',on='index').drop('index',axis=1)
最后我们可以像这样添加repeated 列:
result['repeated'] = (result['what']!='')
print(result)
Module Line Item Formula what where
Module 1 Line Item 1 hello[SUM: hello2] ['hello[SUM: hello2]'] [['Module 1 Line Item 2']]
Module 1 Line Item 2 goodbye[LOOKUP: blue123] + hello[SUM: hello2] ['goodbye[LOOKUP: blue123]', 'hello[SUM: hello2]'] [['Module 2 Line Item 1'], ['Module 1 Line Item 1']]
Module 2 Line Item 1 goodbye[LOOKUP: blue123] + some other line item ['goodbye[LOOKUP: blue123]'] [['Module 1 Line Item 2']]