在python中计算每组相同的条目答案

【问题标题】：Counting same entries per group in python在python中计算每组相同的条目
【发布时间】：2020-08-19 12:59:53
【问题描述】：

我有一个如下形式的数据框：

group base height weight size
0      A     10     5     M
0      A     20     5     M
1      A     10     10    S
2      A      5      5    L

我怎样才能得到一个矩阵，它按组计算类似条目？输出如下所示：

compare  base height weight size
0,1        3/3  2/3   2/3   2/3
0,2        3/3  0/3   3/3   2/3
1,2        2/2  0/2   0/2   0/2

【问题讨论】：

如果您也发布代码，您可能会寻求更好的答案:)
目前我想到的唯一解决方案是逐列逐个分组工作，这可能会在这里填满一页，这会导致一些混乱。
您的数据有多大？
形状为：(117764, 39)，但我可能只想比较其中的一些列

标签： python pandas

【解决方案1】：

解决方案的核心most frequent values
使用itertools.combinations 获取有效的组对
将最频繁与组组合中的每一行进行比较。 sum() 真值矩阵查找匹配数
休息正在形成

df = pd.read_csv(io.StringIO("""group base height weight size
0      A     10     5     M
0      A     20     5     M
1      A     10     10    S
2      A      5      5    L"""), sep="\s+")

# columns we're working with
cols = [c for c in df.columns if c!= "group"]

# iterate over combinations of groups
dfx = pd.DataFrame()
for gp in itertools.combinations(df.group.unique(), 2):
    dfg = df.loc[df.group.isin(gp),cols]
    dfx = pd.concat([dfx, 
                     (dfg == dfg.value_counts().index[0])
                     .sum().to_frame().T.assign(gs=len(dfg), compare=",".join(str(e) for e in gp))
                    ])
# rebase 1 as 0 for comparisons
dfx = dfx.reset_index(drop=True).replace(1,0).astype(str)
# format as required
dfx.loc[:,cols] = dfx[cols].apply(lambda x: x+" / " +dfx["gs"])

dfx.drop(columns="gs")

	base	height	weight	size	compare
0	3 / 3	2 / 3	2 / 3	2 / 3	0,1
1	3 / 3	0 / 3	3 / 3	0 / 3	0,2
2	2 / 2	0 / 2	0 / 2	0 / 2	1,2

【讨论】：