此解决方案涉及对每个组进行迭代,获取每个数据帧的子集并获取每个数据帧值的乘积,以便可以将每一行与其他每一行进行比较。
我们可以使用一些嵌套的 zip/filter/reverse 技巧来仅保留已填写的列。将其放入包含 dfs 和组的日期的列表中,我们可以创建一个数据框、排序、组,并保持每个的最高分。
将它加入第二个 df 应该会给你想要的输出。
import pandas as pd
import numpy as np
from itertools import product
df = pd.DataFrame({'Series': {0: '01/08/2021',
1: '08/08/2021',
2: '15/08/2021',
3: '01/08/2021',
4: '08/08/2021',
5: '15/08/2021'},
'1': {0: '100%', 1: '100%', 2: '100%', 3: '100%', 4: '100%', 5: '100%'},
'2': {0: '75%', 1: '95%', 2: '85%', 3: '70%', 4: '90%', 5: '95%'},
'3': {0: '60%', 1: '80%', 2: '60%', 3: '65%', 4: '80%', 5: '60%'},
'4': {0: '50%', 1: '60%', 2: '40%', 3: '55%', 4: '60%', 5: '40%'},
'5': {0: '40%', 1: '30%', 2: '20%', 3: '45%', 4: '30%', 5: '30%'},
'6': {0: '30%', 1: '10%', 2: '10%', 3: '35%', 4: '10%', 5: '20%'},
'7': {0: '0%', 1: '0%', 2: '5%', 3: '0%', 4: '0%', 5: '5%'},
'GROUP': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B', 5: 'B'}})
df2 = pd.DataFrame({'Series': {0: '01/09/2021',
1: '08/09/2021',
2: '15/09/2021',
3: '15/09/2021'},
'1': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'2': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan},
'3': {0: np.nan, 1: '62%', 2: np.nan, 3: np.nan},
'4': {0: np.nan, 1: '44%', 2: np.nan, 3: np.nan},
'5': {0: '39%', 1: '21%', 2: np.nan, 3: '30%'},
'6': {0: '28%', 1: '12%', 2: '8%', 3: '19%'},
'7': {0: '0%', 1: '7%', 2: '0%', 3: '0%'},
'Similarity_Score': {0: 0.99, 1: 0.99, 2: 1.0, 3: 1.0},
'Similarity_Week': {0: '01/08/2021',
1: '15/08/2021',
2: '08/08/2021',
3: '15/08/2021'},
'Group': {0: 'A', 1: 'A', 2: 'A', 3: 'B'}}
)
df2.drop(columns=['Similarity_Score','Similarity_Week'], inplace=True)
l = []
for g, data in df.groupby('GROUP'):
x = df2.loc[df2['Group']==g]
for c in product(data.values,x.values):
a = c[0][1:-1]
b = c[1][1:-1]
a,b = list(zip(*(zip(reversed(a),list(filter(lambda v: v==v, b))))))
a = [int(x.replace('%',''))/100 for x in a]
b = list(reversed([int(x.replace('%',''))/100 for x in b]))
l.append([g,c[0][0],c[1][0], np.corrcoef(a,b)[1,0]])
out = df2.merge(pd.DataFrame(l, columns=['Group','Similarity_Week','Series','Similarity_Score']).sort_values(by=['Similarity_Score', 'Similarity_Week'], ascending=False).groupby(['Group','Series']).head(1), on=['Group','Series'])
输出
Series 1 2 3 4 5 6 7 Group Similarity_Week \
0 01/09/2021 NaN NaN NaN NaN 39% 28% 0% A 01/08/2021
1 08/09/2021 NaN NaN 62% 44% 21% 12% 7% A 15/08/2021
2 15/09/2021 NaN NaN NaN NaN NaN 8% 0% A 01/08/2021
3 15/09/2021 NaN NaN NaN NaN 30% 19% 0% B 15/08/2021
Similarity_Score
0 0.999405
1 0.999005
2 1.000000
3 0.999286
我相信 2021 年 9 月 15 日 A 组的分数非常相似,因此,如果您对分数进行四舍五入,您会得到不同的最近日期。您可以通过检查来验证这一点
[x for x in l if x[2]=='15/09/2021' and x[0]=='A']
产量
[['A', '01/08/2021', '15/09/2021', 1.0],
['A', '08/08/2021', '15/09/2021', 0.9999999999999998],
['A', '15/08/2021', '15/09/2021', 0.9999999999999998]]
所以理论上 15/08/2021 将是日期,如果您四舍五入到小数点后几位,您可以通过将 round() 放在 np.corrcoef 周围来做到这一点