【问题标题】:Fill missing data and transform rows to column in Python Pandas在 Python Pandas 中填充缺失的数据并将行转换为列
【发布时间】:2020-10-30 15:01:03
【问题描述】:

我有一个这样的数据框,

df_nba = pd.DataFrame({'col1': ['name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',
                                 'name', np.nan,np.nan,'course','pages',
                                 'name', np.nan,np.nan,'course','eca','pages',

                               ], 
                        'col2': ['jim', 'California','M','Biology','Biology Club',1,
                                 'jim', 'California','M','Physics',2,
                                 'greg', 'Arizona','M','Geography','Jazz Band',3,
                                 'greg', 'Arizona','M','Physics','Photography',4,
                                 'jesse', 'Washington','F','Economics',5,
                                 'jesse', 'Washington','F','Literature','Photography',6,
       
                     ]})

col1    col2
0   name    jim
1   NaN California
2   NaN M
3   course  Biology
4   eca Biology Club
5   pages   1
6   name    jim
7   NaN California
8   NaN M
9   course  Physics
10  pages   2
11  name    greg
12  NaN Arizona
13  NaN M
14  course  Geography
15  eca Jazz Band
16  pages   3
17  name    greg
18  NaN Arizona
19  NaN M
20  course  Physics
21  eca Photography
22  pages   4
23  name    jesse
24  NaN Washington
25  NaN F
26  course  Economics
27  pages   5
28  name    jesse
29  NaN Washington
30  NaN F
31  course  Literature
32  eca Photography
33  pages   6

每个人的name 行之后总是缺少两行连续的行。我可以先用StatesGender 填充数据,然后我可以将数据转置到按列视图吗?

输出会是这样的,

        name      states     gender   course           eca           pages
                                      
0       jim      California    M       Biology       Biology Club     1
1       jim      California    M       Physics       NaN              2
2       greg     Arizona       M       Geography     Jazz Band        3
3       greg     Arizona       M       Physics       Photography      4
4      jesse     Washington    F       Economics     NaN              5
5      jesse     Washington    F       Literature    Photography      6

【问题讨论】:

  • 每个名字后面总少两个字段?没有其他变体?

标签: python pandas numpy dataframe


【解决方案1】:

您可以使用值“名称”在 col1 和 shift 中的掩码来填充 col1 中的正确值。然后用unstack 重塑结果,在set_index 之后在掩码上使用cumsum,在col1 和col1 本身中的每个“名称”递增值。

#get a mask where name in col1
mask = df_nba['col1'].eq('name')

# fill the two following nan with the rigth value
df_nba.loc[mask.shift(1,fill_value=False), 'col1'] = 'states'
df_nba.loc[mask.shift(2,fill_value=False), 'col1'] = 'gender'

#reshape
df_ = (df_nba.set_index([mask.cumsum(),
                         df_nba['col1'].to_numpy()])
             ['col2'].unstack()
             .rename_axis(None) #cosmetic
             [['name','states','gender','course','eca','pages']] #reorder the columns
      )

print(df_)
    name      states gender      course           eca pages
1    jim  California      M     Biology  Biology Club     1
2    jim  California      M     Physics           NaN     2
3   greg     Arizona      M   Geography     Jazz Band     3
4   greg     Arizona      M     Physics   Photography     4
5  jesse  Washington      F   Economics           NaN     5
6  jesse  Washington      F  Literature   Photography     6

【讨论】:

  • 您也可以通过pivot df_nba.assign(group=mask.cumsum()).pivot("group", "col1", "col2") 进行重塑部分。
  • 嗨@HenryYik 和本。感谢您的回答!你知道设置pivot方法时如何避免重复吗?我重新运行了您的两个解决方案,它给了我一个错误ValueError: Index contains duplicate entries, cannot reshape,似乎当我将值分组时它没有考虑来自pages 的索引。 Ben 的解决方案在这个测试数据集上运行良好。但是在重塑数据框时,它在我的文件中显示了关于重复项的相同错误。
  • 非常感谢!我终于找到了为什么它不起作用......我的文件中有一些行缺少 name 导致 unstack 函数不起作用。再次感谢代码是完美的!
【解决方案2】:

这不是一个有效的解决方案,但它可以做你想做的事。 如果您提供 col1 和 col2 作为列表

# to fill missing values in col1
for i in range(1,len(col1)):
    if(col1[i-1] == "name"):
       col1[i] = "states"
    if(col1[i-1] == "states"):
       col1[i] = "gender"

# to create list of dictionaries for each record
data=[]
temp={}
for i in range(len(c1)):
    temp[col1[i]]=col2[i]
    if(col1[i]=="pages"):
        data.append(temp)
        temp={}

pd.DataFrame(data)

【讨论】:

    【解决方案3】:

    您可以执行以下操作:

    name_index = df_nba.loc[df_nba['col1']=='name'].index
    for i in name_index:
        df_nba.loc[i+1:i+2, 'col1'] = ['states', 'gender']
    

    现在获取转置表:

    pivot = df_nba.pivot(columns = 'col1')
    pivot_nba = pd.DataFrame()
    for col in pivot['col2']:
        pivot_nba[col] = pivot['col2'][col].dropna().reset_index(drop = True)
    pivot_nba
    
        course        eca               gender  name    pages   states
    0   Biology       Biology Club      M       jim     1       California
    1   Physics       Jazz Band         M       jim     2       California
    2   Geography     Photography       M       greg    3       Arizona
    3   Physics       Photography       M       greg    4       Arizona
    4   Economics     NaN               F       jesse   5       Washington
    5   Literature    NaN               F       jesse   6       Washington
    

    【讨论】:

      猜你喜欢
      • 2016-12-01
      • 1970-01-01
      • 2022-11-02
      • 1970-01-01
      • 2017-05-08
      • 2021-06-17
      • 1970-01-01
      • 2016-11-16
      相关资源
      最近更新 更多