复杂的参考另一张表答案

【问题标题】：Complicated refer to another table复杂的参考另一张表
【发布时间】：2016-08-28 12:05:43
【问题描述】：

我的数据框如下所示：列名 'Types' 显示了每个类型

我想添加另一个名为“数字”的列，定义如下。

df=pd.DataFrame({'Sex':['M','F','F','M'],'Age':[30,31,33,32],'Types':['A','C','B','D']})

Out[8]: 

    Age Sex  Types
0   30   M      A
1   31   F      C
2   33   F      B
3   32   M      D

下面还有一张男桌；每列代表类型！

（创建表对我来说很难，还有其他简单的创建方法吗？）

table_M = pd.DataFrame(np.arange(20).reshape(4,5),index=[30,31,32,33],columns=["A","B","C","D","E"])
table_M.index.name="Age(male)"

         A      B      C      D      E
Age(male)                                   
30       0      1      2      3      4
31       5      6      7      8      9
32      10     11     12     13     14
33      15     16     17     18     19

我下面有女表；

table_F = pd.DataFrame(np.arange(20,40).reshape(4,5),index=[30,31,32,33],columns=["A","B","C","D","E"])
table_F.index.name="Age(female)"

        A      B      C      D      E
Age(female)                                   
30      20     21     22     23     24
31      25     26     27     28     29
32      30     31     32     33     34
33      35     36     37     38     39

所以我想添加如下所示的“数字”列；

    Age Sex  Types   number
0   30   M      A      0 
1   31   F      C     27
2   33   F      B     36
3   32   M      D     13

这个数字列是指女性和男性表。对于每个年龄、类型和性别。这对我来说太复杂了。请问如何添加'number'列？

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

我建议重塑你的男性和女性表：

males = (table_M.stack().to_frame('number').assign(Sex='M').reset_index()
                .rename(columns={'Age(male)': 'Age', 'level_1': 'Types'}))

females = (table_F.stack().to_frame('number').assign(Sex='F').reset_index()
                  .rename(columns={'Age(female)': 'Age', 'level_1': 'Types'}))

reshaped = pd.concat([males, females], ignore_index=True)

然后合并：

df.merge(reshaped)
Out: 
   Age Sex Types  number
0   30   M     A       0
1   31   F     C      27
2   33   F     B      36
3   32   M     D      13

它的作用是将男性和女性表的列堆叠起来，并分配一个显示性别的指示列（“M”和“F”）。 females.head() 看起来像这样：

females.head()
Out: 
   Age Types  number Sex
0   30     A      20   F
1   30     B      21   F
2   30     C      22   F
3   30     D      23   F
4   30     E      24   F

和males.head():

males.head()
Out: 
   Age Types  number Sex
0   30     A       0   M
1   30     B       1   M
2   30     C       2   M
3   30     D       3   M
4   30     E       4   M

使用 pd.concat 将这两个合并到一个 DataFrame 中，默认情况下合并适用于公共列，因此它会在 'Age'、'Sex'、'Types' 列中查找匹配项，并基于此合并两个 DataFrame .

另一种可能性是使用 df.lookup：

df.loc[df['Sex']=='M', 'number'] = table_M.lookup(*df.loc[df['Sex']=='M', ['Age', 'Types']].values.T)
df.loc[df['Sex']=='F', 'number'] = table_F.lookup(*df.loc[df['Sex']=='F', ['Age', 'Types']].values.T)

df
Out: 
   Age Sex Types  number
0   30   M     A     0.0
1   31   F     C    27.0
2   33   F     B    36.0
3   32   M     D    13.0

这会在table_M 中查找男性，在table_F 中查找女性。

【讨论】：

不需要创建新的数据框。看我的回答。
@NehalJWani apply with axis=1 效率非常低，只要有矢量化替代方案，就应该避免使用它。只有四行可能没问题，但如果您有数百万条记录需要永远运行。除此之外，在我看来，整理数据集从来没有必要。它使您的分析中的后续步骤更加容易。
应用 10000 行需要 41.1 秒，而 stack/concat/merge 在 482 毫秒内完成，其中大部分是开销（如果将其增加到 100 万行，则需要 672 毫秒）。跨度>
啊，我明白了。感谢您提供信息 :) 每天都要学习新东西！
@MiyashitaHikaru 查找有两个参数：row_labels 和 col_labels。另一方面，df.loc[df['Sex']=='M', ['Age', 'Types']].values.T 是一个包含两个其他 numpy 数组的单个 numpy 数组。 * 操作员解压缩该主数组，因此第一个数组是 row_labels，第二个数组是 col_labels。这是a more detailed explanation。

【解决方案2】：

如果您将两个表组合在一起，这样您就可以通过apply 访问'Sex'，这样会更容易。

table = pd.concat([table_F, table_M], axis=1, keys=['F', 'M'])

accessor = lambda row: table.loc[row.Age, (row.Sex, row.Types)]
df['number'] = df.apply(accessor, axis=1)
df

【讨论】：

我花了相当多的时间在谷歌上搜索如何使用 pd.concat 添加指标列。从来没想过看参数。 :)
这是一个不错的解决方案！
@piRSquared - 恭喜获得 20k。 ;)

【解决方案3】：

另一种方法：

In [60]: df['numbers'] = df.apply(lambda x: table_F.loc[[x.Age]][x.Types].iloc[0] if x.Sex == 'F' else table_M.loc[[x.Age]][x.Types].iloc[0], axis = 1)

In [60]: df
Out[60]: 
   Age Sex Types  numbers
0   30   M     A        0
1   31   F     C       27
2   33   F     B       36
3   32   M     D       13

【讨论】：