基于多列的索引答案

【问题标题】：Indexing based on multiple columns基于多列的索引
【发布时间】：2021-09-06 09:01:50
【问题描述】：

我是 python 新手，下面提到的是我目前正在尝试解决的一个持续的数据工程问题。

表结构

数据：

索引 1：

是连续的并且会随着行的增加而增加 1。

索引 2：问题

这取决于存储在 [A,B,C,D,E] 列中的值。如果值保持不变，我们需要为这些行分配一个索引。

例如：第 1、2、3 行分别具有 567 作为 A、B、C 的值。因此，这 3 行的索引 2 为 100。

记录类型：

1-A
2-B
3-C
4-D
5 - E

代码

data =  [(100, 100, 1 , 567,'','','','') ,
         (101, 100, 2 , '',567,'','','') ,
         (102, 100, 3 , '','',567,'','') ,
         (103, 101, 3 , '','',568,'','') ,
         (104, 101, 4 , '','','',568,'') ,
         (105, 101, 5 , '','','','',568) ]  

#Creates the data frame
df = pd.DataFrame( data, columns = ['index1' , 'index2', 'record_type' , 'A','B','C','D','E'], dtype=str)

#Combines columns A,B,C,D,E and adds a $ where ever it is null in order to stack these values 
df['combined'] = df[['A', 'B', 'C','D','E']].stack().groupby(level=0).agg('$'.join)

# Cleans the column 'combined'
df['combined_cleaned']= df['combined'].replace({'\$':''}, regex = True)

尝试使用 combine_cleaned 列来计算 index2。不确定这是否是正确的方法，欢迎提出建议。

【问题讨论】：

几个澄清问题。 (1) 用于计算的列数是否始终相同，是否始终标记为“A 到 E”？ (2) 如果值 567 稍后出现在数据框中会发生什么情况（例如 index1 = 109，列 D = 567？
df[list('ABCDE')].T.agg(''.join).factorize()[0] + 100 应该足够了
@itprorh66 回答您的问题： 1) 总列数 = 6：这是固定的。 2) 如果 D 或 E 列有 567，则 index2 的值为 100 基本上，我们只能在单行中的 A 到 E 列中具有单个值。因此，如果 567 出现在 D 列中，那将是一个新行，其 index1 值为 106，index2 值为 100。

标签： python-3.x pandas indexing

【解决方案1】：

这里有一些假设，但似乎适合您的问题。

如果每行的这些列中只有 1 个值，那么您可以沿行获取 max，然后找到连续的组，检查该系列是否等于自身，移位。

我们添加 99，因为根据定义，计数将从 1 开始，但您似乎想要 100。

val_cols = ['A', 'B', 'C', 'D', 'E']
s = df[val_cols].apply(pd.to_numeric).max(1)
#0    567.0
#1    567.0
#2    567.0
#3    568.0
#4    568.0
#5    568.0
#dtype: float64

df['index2'] = s.ne(s.shift()).cumsum() + 99

print(df)
  index1 record_type    A    B    C    D    E  index2
0    100           1  567                         100
1    101           2       567                    100
2    102           3            567               100
3    103           3            568               101
4    104           4                 568          101
5    105           5                      568     101

如果'record_type' 不是单个值，而是指向适当的列，您可以使用 numpy 索引。

import numpy as np

arr = df[val_cols].to_numpy()
idx = df['record_type'].astype(int).to_numpy()

vals = arr[np.arange(len(arr)), idx-1]
#array(['567', '567', '567', '568', '568', '568'], dtype=object)

【讨论】：

【解决方案2】：

combined_cleaned 列可以直接使用

生成

cols = ['A', 'B', 'C','D','E']
df[cols].replace('', np.nan).apply(lambda x: x.dropna().item(), axis=1)

【讨论】：

OP 没有提到那部分代码。因此，我认为他已经拥有它。

【解决方案3】：

您也可以尝试使用stack 后跟factorize：

cols = ['A', 'B', 'C','D','E']
s = pd.factorize(df[cols].replace('',np.nan).stack())[0]
df['index2_new'] = int(df['index1'].iat[0]) + s

print(df)

  index1 index2 record_type    A    B    C    D    E  index2_new
0    100    100           1  567                             100
1    101    100           2       567                        100
2    102    100           3            567                   100
3    103    101           3            568                   101
4    104    101           4                 568              101
5    105    101           5                      568         101

【讨论】：

factorize 是做什么的？ @anky
@Vishnudev 它就像每个唯一值的字符串索引器
哦。那太酷了。学到了一些新东西。谢谢@anky
@anky Index1 和 Index2 是独立的