Pandas Groupby / 多行列表答案

【问题标题】：Pandas Groupby / List to Multiple RowsPandas Groupby / 多行列表
【发布时间】：2021-09-07 13:32:47
【问题描述】：

在这个例子中，我每行总共有 7 列。我按 AccountID 和姓氏分组。按 AccountID 和 Last Name 分组标识同一个人； Contract、Address、City 和 State 的不同行值表示 AccountID/Last Name 的新位置。

我希望将 AccountID/Last Name 与一组或多组合同、地址、城市和州放在一行中。

当前数据如下所示：

Contract	AccountID	Last Name	First Name	Address	City	State
622	1234	Pitt	Brad	466 7th Ave	Park Slope	NY
28974	1234	Pitt	Brad	1901 Vine Street	Philadelphia	PA
54122	4321	Ford	Henry	93 Booth Dr	Nutley	NJ
622	2345	Rhodes	Dusty	1 Public Library Plaze	Stamford	CT
28974	2345	Rhodes	Dusty	1001 Kings Highway	Cherry Hill	NJ
54122	2345	Rhodes	Dusty	444 Amsterdamn Ave	Upper West Side	NY

希望这样显示数据：

AccountID	Last Name	First Name	Contract.1	Address_1	City_1	State_1	Contract_2	Address_2	City_2	State_2	Contract_3	Address_3	City_3	State_3
1234	Pitt	Brad	622	466 7th Ave	Park Slope	NY	28974.0	1901 Vine Street	Philadelphia	PA
4321	Ford	Henry	54122	93 Booth Dr	Nutley	NJ
2345	Rhodes	Dusty	622	1 Public Library Plaze	Stamford	CT	28974.0	1001 Kings Highway	Cherry Hill	NJ	54122.0	444 Amsterdamn Ave	Upper West Side	NY

这是我到目前为止所做的。第 5 步及以后我一直在返工一周。有什么建议吗？

# Step 1
import pandas as pd
import numpy as np
# read from "my clipboard"
df = pd.read_clipboard()
df

#Step 2
df['Contract_State'] = (df['Contract'].astype(str) + '|' + df['Address']  + '|' + df['City']  + '|' + df['State']).str.split()
df['Contract'] = df['Contract'].astype(str)
df['AccountID'] = df['AccountID'].astype(str)

# Step 3 - groupby
df2 = pd.DataFrame(df.groupby(['AccountID', 'Last Name']).Contract_State.apply(list)).reset_index()
df2

# Step 4 - flatten the lists
df2['Contract_State'] = df2['Contract_State'].apply(lambda x: np.array(x).flatten())
df2

# Step 5 - The number of elements in lists each list is always even => /2
num_columns = df2['Contract_State'].apply(len).max()
num_columns 

# Step 6
df3 = pd.DataFrame(list(df2['Contract_State']), columns=columns)
df3

# Step 7 - concatenate df2 with contracts, then drop the column "Contract_State"
df4 = pd.concat([df2, df3], join='inner', axis='columns').drop('Contract_State', axis='columns')
df4

【问题讨论】：

如您所见，有很多方法可以重塑您的表格，但最明显的技巧是使用groupby 和cumcount。

标签： python pandas pandas-groupby

【解决方案1】：

IIUC，我认为你可以这样做：

dfg = df.groupby(['AccountID', 'Last Name', df.groupby(['AccountID', 'Last Name']).cumcount() + 1]).first().unstack()
dfg.columns = [f'{i}{j}' for i, j in dfg.columns]
df_out = dfg.sort_index(axis=1, key=lambda x: x.str[-1])
df_out.reset_index()

输出：

   AccountID Last Name  Contract1 First Name1                Address1       City1 State1  Contract2 First Name2            Address2         City2 State2  Contract3 First Name3            Address3            City3 State3
0       1234      Pitt      622.0        Brad             466 7th Ave  Park Slope     NY    28974.0        Brad    1901 Vine Street  Philadelphia     PA        NaN         NaN                 NaN              NaN    NaN
1       2345    Rhodes      622.0       Dusty  1 Public Library Plaze    Stamford     CT    28974.0       Dusty  1001 Kings Highway   Cherry Hill     NJ    54122.0       Dusty  444 Amsterdamn Ave  Upper West Side     NY
2       4321      Ford    54122.0       Henry             93 Booth Dr      Nutley     NJ        NaN         NaN                 NaN           NaN    NaN        NaN         NaN                 NaN              NaN    NaN

【讨论】：

这太干净了！
@not_speshal 我也喜欢你的和 Pygirl 的！两者都由我 +1！
就像你说的，groupby 和 cumcount 的类似技巧 :)。我确实认为您的答案应该是公认的答案！
非常感谢！这段代码就像一个魅力！将继续研究代码。

【解决方案2】：

试试groupby 和pivot_table：

df["group"] = df.groupby(["AccountID", "Last Name", "First Name"]).cumcount()+1
output = df.pivot_table(index=["AccountID", "Last Name", "First Name"], 
                        columns='group', 
                        values=['Address', 'City', "State"], 
                        aggfunc='first')
output = output.sort_index(axis=1, level=1)
output.columns = [f"{i}_{j}" for i, j in output.columns]
output = output.reset_index()

【讨论】：

【解决方案3】：

您可以尝试使用groupby 和unstack：

grp_col = ['AccountID', 'Last Name', 'First Name']
df['num'] = df.groupby(grp_col).cumcount()+1
res = df.set_index([*grp_col, 'num']).unstack('num').sort_index(axis=1, level=1).reset_index()
res.columns = res.columns.map(lambda x: f"{x[0]}{x[1]}")

分辨率：

	AccountID	Last Name	First Name	Contract1	Address1	City1	State1	Contract2	Address2	City2	State2	Contract3	Address3	City3	State3
0	1234	Pitt	Brad	622.0	466 7th Ave	Park Slope	NY	28974.0	1901 Vine Street	Philadelphia	PA	NaN	NaN	NaN	NaN
1	2345	Rhodes	Dusty	622.0	1 Public Library Plaze	Stamford	CT	28974.0	1001 Kings Highway	Cherry Hill	NJ	54122.0	444 Amsterdamn Ave	Upper West Side	NY
2	4321	Ford	Henry	54122.0	93 Booth Dr	Nutley	NJ	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

【讨论】：

【解决方案4】：

我们可以将一个系列直接传递给pivot_table，首先是aggfunc。使用groupby cumcount 枚举成为新列后缀的组行：

cols = ['AccountID', 'Last Name', 'First Name']
dfp = (
    df.pivot_table(
        index=cols,
        columns=df.groupby(cols).cumcount() + 1,
        aggfunc='first'
    ).sort_index(axis=1, level=1, sort_remaining=False)
)
# Collapse Multi-Index
dfp.columns = dfp.columns.map(lambda t: '_'.join(map(str, t)))
dfp = dfp.reset_index()

或者使用set_index + unstack 没有groubpy first，因为 cumcount 确保列的唯一性：

cols = ['AccountID', 'Last Name', 'First Name']
dfw = df.set_index(
    [*cols, df.groupby(cols).cumcount() + 1]
).unstack().sort_index(axis=1, level=1, sort_remaining=False)
# Collapse Multi-Index
dfw.columns = dfw.columns.map(lambda t: '_'.join(map(str, t)))
dfu = dfw.reset_index()

任一选项都会产生：

   AccountID Last Name First Name               Address_1      City_1  Contract_1 State_1           Address_2        City_2  Contract_2 State_2           Address_3           City_3  Contract_3 State_3
0       1234      Pitt       Brad             466 7th Ave  Park Slope       622.0      NY    1901 Vine Street  Philadelphia     28974.0      PA                 NaN              NaN         NaN     NaN
1       2345    Rhodes      Dusty  1 Public Library Plaze    Stamford       622.0      CT  1001 Kings Highway   Cherry Hill     28974.0      NJ  444 Amsterdamn Ave  Upper West Side     54122.0      NY
2       4321      Ford      Henry             93 Booth Dr      Nutley     54122.0      NJ                 NaN           NaN         NaN     NaN                 NaN              NaN         NaN     NaN

pyjanitor 模块对此操作有一个抽象，称为 pivot_wider，它可以隐藏 MultiIndex 的折叠和索引列的恢复：

# pip install pyjanitor
# conda install pyjanitor -c conda-forge
import janitor
import pandas as pd


cols = ['AccountID', 'Last Name', 'First Name']
dfw = (
    df.add_column(
        'group', df.groupby(cols).cumcount() + 1
    ).pivot_wider(
        index=cols,
        names_from='group'
    )
)

   AccountID Last Name First Name  Contract_1  Contract_2  Contract_3               Address_1           Address_2           Address_3      City_1        City_2           City_3 State_1 State_2 State_3
0       1234      Pitt       Brad       622.0     28974.0         NaN             466 7th Ave    1901 Vine Street                 NaN  Park Slope  Philadelphia              NaN      NY      PA     NaN
1       2345    Rhodes      Dusty       622.0     28974.0     54122.0  1 Public Library Plaze  1001 Kings Highway  444 Amsterdamn Ave    Stamford   Cherry Hill  Upper West Side      CT      NJ      NY
2       4321      Ford      Henry     54122.0         NaN         NaN             93 Booth Dr                 NaN                 NaN      Nutley           NaN              NaN      NJ     NaN     NaN

还有一个抽象来处理 MultiIndex janitor.collapse_levels 的折叠，它可以与 pandas 操作一起使用，以在不放弃 pivot_table 和 sort_index 提供的灵活性的情况下创建更干净的外观：

cols = ['AccountID', 'Last Name', 'First Name']
dfp = (
    df.pivot_table(
        index=cols,
        columns=df.groupby(cols).cumcount() + 1,
        aggfunc='first'
    ).sort_index(
        axis=1, level=1, sort_remaining=False
    ).collapse_levels(sep='_').reset_index()
)

dfp:

   AccountID Last Name First Name               Address_1      City_1  Contract_1 State_1           Address_2        City_2  Contract_2 State_2           Address_3           City_3  Contract_3 State_3
0       1234      Pitt       Brad             466 7th Ave  Park Slope       622.0      NY    1901 Vine Street  Philadelphia     28974.0      PA                 NaN              NaN         NaN     NaN
1       2345    Rhodes      Dusty  1 Public Library Plaze    Stamford       622.0      CT  1001 Kings Highway   Cherry Hill     28974.0      NJ  444 Amsterdamn Ave  Upper West Side     54122.0      NY
2       4321      Ford      Henry             93 Booth Dr      Nutley     54122.0      NJ                 NaN           NaN         NaN     NaN                 NaN              NaN         NaN     NaN

【讨论】：