【问题标题】:How to replace certain values in a pandas column with the mean column value of similar rows?如何用相似行的平均列值替换熊猫列中的某些值?
【发布时间】:2018-06-17 02:37:11
【问题描述】:

问题

我目前有一个熊猫数据框,其中包含来自this kaggle 数据集的属性信息。以下是该集合中的示例数据框:

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Annadale      | 5       | 5425  | 2015       | ... |
| Woodside      | 4       | 2327  | 1966       | ... |
| Alphabet City | 1       | 396   | 1985       | ... |
| Alphabet City | 1       | 405   | 1996       | ... |
| Alphabet City | 1       | 396   | 1986       | ... |
| Alphabet City | 1       | 396   | 1992       | ... |
| Alphabet City | 1       | 396   | 0          | ... |
| Alphabet City | 1       | 396   | 1990       | ... |
| Alphabet City | 1       | 396   | 1984       | ... |
| Alphabet City | 1       | 396   | 0          | ... |

我想要做的是获取“建造年份”列中的值等于零的每一行,并将这些行中的“建造年份”值替换为行中“建造年份”值的中位数同一个社区、自治市镇和街区。在某些情况下,{neighborhood, borough, block} 集中的多行在“建造年份”列中为零。这显示在上面的示例数据框中。

为了说明问题,我将这两行放在示例数据框中。

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 0          | ... |
| Alphabet City | 1       | 396   | 0          | ... |

为了解决这个问题,我想使用具有相同邻域、自治市镇和街区的所有其他行的“建造年份”值的平均值来填充具有零的行中的“建造年份”值“建造年份”列。对于示例行,邻域为字母市,自治市镇为 1,街区为 396,因此我将使用示例数据框中的以下匹配行来计算平均值:

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 1985       | ... |
| Alphabet City | 1       | 396   | 1986       | ... |
| Alphabet City | 1       | 396   | 1992       | ... |
| Alphabet City | 1       | 396   | 1990       | ... |
| Alphabet City | 1       | 396   | 1984       | ... |

我会从这些行(即 1987.4)中取“建造年份”列的平均值,并用平均值替换零。最初有零的行最终看起来像这样:

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 1987.4     | ... |
| Alphabet City | 1       | 396   | 1987.4     | ... |

我目前的代码

到目前为止,我所做的只是在“建造年份”列中删除带有零的行,并找到每个 {neighborhood, borough, block} 集的平均年份。原始数据框存储在 raw_data 中,它看起来像本文顶部的示例数据框。代码如下所示:

# create a copy of the data
temp_data = raw_data.copy()

# remove all rows with zero in the "year built" column
mean_year_by_location = temp_data[temp_data["YEAR BUILT"] > 0]

# group the rows into {neighborhood, borough, block} sets and take the mean of the "year built" column in those sets
mean_year_by_location = mean_year_by_location.groupby(["NEIGHBORHOOD","BOROUGH","BLOCK"], as_index = False)["YEAR BUILT"].mean()

输出如下所示:

| neighborhood  | borough | block | year built | 
------------------------------------------------
| ....          | ...     | ...   | ...        |
| Alphabet City | 1       | 390   | 1985.342   | 
| Alphabet City | 1       | 391   | 1986.76    | 
| Alphabet City | 1       | 392   | 1992.8473  | 
| Alphabet City | 1       | 393   | 1990.096   | 
| Alphabet City | 1       | 394   | 1984.45    | 

那么我怎样才能从 mean_year_by_location 数据帧中获取这些平均“年构建”值并替换原始 raw_data 数据帧中的零?

我为这篇冗长的帖子道歉。我只想说清楚。

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    使用set_index + replace,然后在mean 上使用fillna

    v = df.set_index(
        ['neighborhood', 'borough', 'block']
    )['year built'].replace(0, np.nan)   
    
    df = v.fillna(v.mean(level=[0, 1, 2])).reset_index()
    df
    
        neighborhood  borough  block  year built
    0       Annadale        5   5425      2015.0
    1       Woodside        4   2327      1966.0
    2  Alphabet City        1    396      1985.0
    3  Alphabet City        1    405      1996.0
    4  Alphabet City        1    396      1986.0
    5  Alphabet City        1    396      1992.0
    6  Alphabet City        1    396      1987.4
    7  Alphabet City        1    396      1990.0
    8  Alphabet City        1    396      1984.0
    9  Alphabet City        1    396      1987.4
    

    详情

    首先,设置索引,将0替换为NaN,这样接下来的mean计算就不会受到这些值的影响——

    v = df.set_index(
        ['neighborhood', 'borough', 'block']
    )['year built'].replace(0, np.nan)   
    
    v 
    
    neighborhood   borough  block
    Annadale       5        5425     2015.0
    Woodside       4        2327     1966.0
    Alphabet City  1        396      1985.0
                            405      1996.0
                            396      1986.0
                            396      1992.0
                            396         NaN
                            396      1990.0
                            396      1984.0
                            396         NaN
    Name: year built, dtype: float64
    

    接下来,计算mean -

    m = v.mean(level=[0, 1, 2])
    m
    
    neighborhood   borough  block
    Annadale       5        5425     2015.0
    Woodside       4        2327     1966.0
    Alphabet City  1        396      1987.4
                            405      1996.0
    Name: year built, dtype: float64
    

    这用作映射,我们将其传递给fillnafillna 相应地替换了前面介绍的 NaN,并用索引映射的相应平均值替换它们。完成后,只需重置索引即可恢复我们的原始结构。

    v.fillna(m).reset_index()
    
        neighborhood  borough  block  year built
    0       Annadale        5   5425      2015.0
    1       Woodside        4   2327      1966.0
    2  Alphabet City        1    396      1985.0
    3  Alphabet City        1    405      1996.0
    4  Alphabet City        1    396      1986.0
    5  Alphabet City        1    396      1992.0
    6  Alphabet City        1    396      1987.4
    7  Alphabet City        1    396      1990.0
    8  Alphabet City        1    396      1984.0
    9  Alphabet City        1    396      1987.4
    

    【讨论】:

    • @coldspeed,不应该意味着索引 6 和 9 不同吗?一旦将平均值放在第 6 行,对于第 9 行,我们将需要重新计算平均值,因为第 6 行的值在我们迭代到第 9 行时发生了变化。
    • @Anil_M 如果您仔细阅读 OP 的问题,您会发现这不是他们所要求的。他们只想用该组的平均值填充 NaN。
    【解决方案2】:

    我将在groupby.apply 中使用mask。我这样做只是因为我喜欢它流动的方式。我没有声称它特别快。不过,这个答案可能会提供一些关于可能的替代方案的观点。

    gidx = ['neighborhood', 'borough', 'block']
    
    def fill_with_mask(s):
        mean = s.loc[lambda x: x != 0].mean()
        return s.mask(s.eq(0), mean)
    
    df.groupby(gidx)['year built'].apply(fill_with_mask)
    
    0    2015.0
    1    1966.0
    2    1985.0
    3    1996.0
    4    1986.0
    5    1992.0
    6    1987.4
    7    1990.0
    8    1984.0
    9    1987.4
    Name: year built, dtype: float64
    

    然后我们可以使用pd.DataFrame.assign创建数据帧的副本

    df.assign(**{'year built': df.groupby(gidx)['year built'].apply(fill_with_mask)})
    
        neighborhood  borough  block  year built
    0       Annadale        5   5425      2015.0
    1       Woodside        4   2327      1966.0
    2  Alphabet City        1    396      1985.0
    3  Alphabet City        1    405      1996.0
    4  Alphabet City        1    396      1986.0
    5  Alphabet City        1    396      1992.0
    6  Alphabet City        1    396      1987.4
    7  Alphabet City        1    396      1990.0
    8  Alphabet City        1    396      1984.0
    9  Alphabet City        1    396      1987.4
    

    同样的任务可以通过列分配就地完成:

    df['year built'] = df.groupby(gidx)['year built'].apply(fill_with_mask)
    

    或者

    df.update(df.groupby(gidx)['year built'].apply(fill_with_mask))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-10-27
      • 2015-09-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多