【问题标题】:Pandas Groupby / List to Multiple RowsPandas Groupby / 多行列表
【发布时间】:2021-09-07 13:32:47
【问题描述】:

在这个例子中,我每行总共有 7 列。我按 AccountID 和姓氏分组。按 AccountID 和 Last Name 分组标识同一个人; Contract、Address、City 和 State 的不同行值表示 AccountID/Last Name 的新位置。

我希望将 AccountID/Last Name 与一组或多组合同、地址、城市和州放在一行中。

当前数据如下所示:

Contract AccountID Last Name First Name Address City State
622 1234 Pitt Brad 466 7th Ave Park Slope NY
28974 1234 Pitt Brad 1901 Vine Street Philadelphia PA
54122 4321 Ford Henry 93 Booth Dr Nutley NJ
622 2345 Rhodes Dusty 1 Public Library Plaze Stamford CT
28974 2345 Rhodes Dusty 1001 Kings Highway Cherry Hill NJ
54122 2345 Rhodes Dusty 444 Amsterdamn Ave Upper West Side NY

希望这样显示数据:

AccountID Last Name First Name Contract.1 Address_1 City_1 State_1 Contract_2 Address_2 City_2 State_2 Contract_3 Address_3 City_3 State_3
1234 Pitt Brad 622 466 7th Ave Park Slope NY 28974.0 1901 Vine Street Philadelphia PA
4321 Ford Henry 54122 93 Booth Dr Nutley NJ
2345 Rhodes Dusty 622 1 Public Library Plaze Stamford CT 28974.0 1001 Kings Highway Cherry Hill NJ 54122.0 444 Amsterdamn Ave Upper West Side NY

这是我到目前为止所做的。第 5 步及以后我一直在返工一周。有什么建议吗?

# Step 1
import pandas as pd
import numpy as np
# read from "my clipboard"
df = pd.read_clipboard()
df

#Step 2
df['Contract_State'] = (df['Contract'].astype(str) + '|' + df['Address']  + '|' + df['City']  + '|' + df['State']).str.split()
df['Contract'] = df['Contract'].astype(str)
df['AccountID'] = df['AccountID'].astype(str)

# Step 3 - groupby
df2 = pd.DataFrame(df.groupby(['AccountID', 'Last Name']).Contract_State.apply(list)).reset_index()
df2

# Step 4 - flatten the lists
df2['Contract_State'] = df2['Contract_State'].apply(lambda x: np.array(x).flatten())
df2

# Step 5 - The number of elements in lists each list is always even => /2
num_columns = df2['Contract_State'].apply(len).max()
num_columns 

# Step 6
df3 = pd.DataFrame(list(df2['Contract_State']), columns=columns)
df3

# Step 7 - concatenate df2 with contracts, then drop the column "Contract_State"
df4 = pd.concat([df2, df3], join='inner', axis='columns').drop('Contract_State', axis='columns')
df4

【问题讨论】:

  • 如您所见,有很多方法可以重塑您的表格,但最明显的技巧是使用groupbycumcount

标签: python pandas pandas-groupby


【解决方案1】:

IIUC,我认为你可以这样做:

dfg = df.groupby(['AccountID', 'Last Name', df.groupby(['AccountID', 'Last Name']).cumcount() + 1]).first().unstack()
dfg.columns = [f'{i}{j}' for i, j in dfg.columns]
df_out = dfg.sort_index(axis=1, key=lambda x: x.str[-1])
df_out.reset_index()

输出:

   AccountID Last Name  Contract1 First Name1                Address1       City1 State1  Contract2 First Name2            Address2         City2 State2  Contract3 First Name3            Address3            City3 State3
0       1234      Pitt      622.0        Brad             466 7th Ave  Park Slope     NY    28974.0        Brad    1901 Vine Street  Philadelphia     PA        NaN         NaN                 NaN              NaN    NaN
1       2345    Rhodes      622.0       Dusty  1 Public Library Plaze    Stamford     CT    28974.0       Dusty  1001 Kings Highway   Cherry Hill     NJ    54122.0       Dusty  444 Amsterdamn Ave  Upper West Side     NY
2       4321      Ford    54122.0       Henry             93 Booth Dr      Nutley     NJ        NaN         NaN                 NaN           NaN    NaN        NaN         NaN                 NaN              NaN    NaN

【讨论】:

  • 这太干净了!
  • @not_speshal 我也喜欢你的和 Pygirl 的!两者都由我 +1!
  • 就像你说的,groupbycumcount 的类似技巧 :)。我确实认为您的答案应该是公认的答案!
  • 非常感谢!这段代码就像一个魅力!将继续研究代码。
【解决方案2】:

试试groupbypivot_table

df["group"] = df.groupby(["AccountID", "Last Name", "First Name"]).cumcount()+1
output = df.pivot_table(index=["AccountID", "Last Name", "First Name"], 
                        columns='group', 
                        values=['Address', 'City', "State"], 
                        aggfunc='first')
output = output.sort_index(axis=1, level=1)
output.columns = [f"{i}_{j}" for i, j in output.columns]
output = output.reset_index()

【讨论】:

    【解决方案3】:

    您可以尝试使用groupbyunstack

    grp_col = ['AccountID', 'Last Name', 'First Name']
    df['num'] = df.groupby(grp_col).cumcount()+1
    res = df.set_index([*grp_col, 'num']).unstack('num').sort_index(axis=1, level=1).reset_index()
    res.columns = res.columns.map(lambda x: f"{x[0]}{x[1]}")
    

    分辨率:

    AccountID Last Name First Name Contract1 Address1 City1 State1 Contract2 Address2 City2 State2 Contract3 Address3 City3 State3
    0 1234 Pitt Brad 622.0 466 7th Ave Park Slope NY 28974.0 1901 Vine Street Philadelphia PA NaN NaN NaN NaN
    1 2345 Rhodes Dusty 622.0 1 Public Library Plaze Stamford CT 28974.0 1001 Kings Highway Cherry Hill NJ 54122.0 444 Amsterdamn Ave Upper West Side NY
    2 4321 Ford Henry 54122.0 93 Booth Dr Nutley NJ NaN NaN NaN NaN NaN NaN NaN NaN

    【讨论】:

      【解决方案4】:

      我们可以将一个系列直接传递给pivot_table,首先是aggfunc。使用groupby cumcount 枚举成为新列后缀的组行:

      cols = ['AccountID', 'Last Name', 'First Name']
      dfp = (
          df.pivot_table(
              index=cols,
              columns=df.groupby(cols).cumcount() + 1,
              aggfunc='first'
          ).sort_index(axis=1, level=1, sort_remaining=False)
      )
      # Collapse Multi-Index
      dfp.columns = dfp.columns.map(lambda t: '_'.join(map(str, t)))
      dfp = dfp.reset_index()
      

      或者使用set_index + unstack 没有groubpy first,因为 cumcount 确保列的唯一性:

      cols = ['AccountID', 'Last Name', 'First Name']
      dfw = df.set_index(
          [*cols, df.groupby(cols).cumcount() + 1]
      ).unstack().sort_index(axis=1, level=1, sort_remaining=False)
      # Collapse Multi-Index
      dfw.columns = dfw.columns.map(lambda t: '_'.join(map(str, t)))
      dfu = dfw.reset_index()
      

      任一选项都会产生:

         AccountID Last Name First Name               Address_1      City_1  Contract_1 State_1           Address_2        City_2  Contract_2 State_2           Address_3           City_3  Contract_3 State_3
      0       1234      Pitt       Brad             466 7th Ave  Park Slope       622.0      NY    1901 Vine Street  Philadelphia     28974.0      PA                 NaN              NaN         NaN     NaN
      1       2345    Rhodes      Dusty  1 Public Library Plaze    Stamford       622.0      CT  1001 Kings Highway   Cherry Hill     28974.0      NJ  444 Amsterdamn Ave  Upper West Side     54122.0      NY
      2       4321      Ford      Henry             93 Booth Dr      Nutley     54122.0      NJ                 NaN           NaN         NaN     NaN                 NaN              NaN         NaN     NaN
      

      pyjanitor 模块对此操作有一个抽象,称为 pivot_wider,它可以隐藏 MultiIndex 的折叠和索引列的恢复:

      # pip install pyjanitor
      # conda install pyjanitor -c conda-forge
      import janitor
      import pandas as pd
      
      
      cols = ['AccountID', 'Last Name', 'First Name']
      dfw = (
          df.add_column(
              'group', df.groupby(cols).cumcount() + 1
          ).pivot_wider(
              index=cols,
              names_from='group'
          )
      )
      
         AccountID Last Name First Name  Contract_1  Contract_2  Contract_3               Address_1           Address_2           Address_3      City_1        City_2           City_3 State_1 State_2 State_3
      0       1234      Pitt       Brad       622.0     28974.0         NaN             466 7th Ave    1901 Vine Street                 NaN  Park Slope  Philadelphia              NaN      NY      PA     NaN
      1       2345    Rhodes      Dusty       622.0     28974.0     54122.0  1 Public Library Plaze  1001 Kings Highway  444 Amsterdamn Ave    Stamford   Cherry Hill  Upper West Side      CT      NJ      NY
      2       4321      Ford      Henry     54122.0         NaN         NaN             93 Booth Dr                 NaN                 NaN      Nutley           NaN              NaN      NJ     NaN     NaN
      

      还有一个抽象来处理 MultiIndex janitor.collapse_levels 的折叠,它可以与 pandas 操作一起使用,以在不放弃 pivot_tablesort_index 提供的灵活性的情况下创建更干净的外观:

      cols = ['AccountID', 'Last Name', 'First Name']
      dfp = (
          df.pivot_table(
              index=cols,
              columns=df.groupby(cols).cumcount() + 1,
              aggfunc='first'
          ).sort_index(
              axis=1, level=1, sort_remaining=False
          ).collapse_levels(sep='_').reset_index()
      )
      

      dfp:

         AccountID Last Name First Name               Address_1      City_1  Contract_1 State_1           Address_2        City_2  Contract_2 State_2           Address_3           City_3  Contract_3 State_3
      0       1234      Pitt       Brad             466 7th Ave  Park Slope       622.0      NY    1901 Vine Street  Philadelphia     28974.0      PA                 NaN              NaN         NaN     NaN
      1       2345    Rhodes      Dusty  1 Public Library Plaze    Stamford       622.0      CT  1001 Kings Highway   Cherry Hill     28974.0      NJ  444 Amsterdamn Ave  Upper West Side     54122.0      NY
      2       4321      Ford      Henry             93 Booth Dr      Nutley     54122.0      NJ                 NaN           NaN         NaN     NaN                 NaN              NaN         NaN     NaN
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2019-12-26
        • 2020-11-05
        • 2015-12-15
        • 2017-01-28
        • 2017-08-06
        • 1970-01-01
        • 2023-03-07
        相关资源
        最近更新 更多