【发布时间】:2018-06-17 02:37:11
【问题描述】:
问题
我目前有一个熊猫数据框,其中包含来自this kaggle 数据集的属性信息。以下是该集合中的示例数据框:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Annadale | 5 | 5425 | 2015 | ... |
| Woodside | 4 | 2327 | 1966 | ... |
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 405 | 1996 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
我想要做的是获取“建造年份”列中的值等于零的每一行,并将这些行中的“建造年份”值替换为行中“建造年份”值的中位数同一个社区、自治市镇和街区。在某些情况下,{neighborhood, borough, block} 集中的多行在“建造年份”列中为零。这显示在上面的示例数据框中。
为了说明问题,我将这两行放在示例数据框中。
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
为了解决这个问题,我想使用具有相同邻域、自治市镇和街区的所有其他行的“建造年份”值的平均值来填充具有零的行中的“建造年份”值“建造年份”列。对于示例行,邻域为字母市,自治市镇为 1,街区为 396,因此我将使用示例数据框中的以下匹配行来计算平均值:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
我会从这些行(即 1987.4)中取“建造年份”列的平均值,并用平均值替换零。最初有零的行最终看起来像这样:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1987.4 | ... |
| Alphabet City | 1 | 396 | 1987.4 | ... |
我目前的代码
到目前为止,我所做的只是在“建造年份”列中删除带有零的行,并找到每个 {neighborhood, borough, block} 集的平均年份。原始数据框存储在 raw_data 中,它看起来像本文顶部的示例数据框。代码如下所示:
# create a copy of the data
temp_data = raw_data.copy()
# remove all rows with zero in the "year built" column
mean_year_by_location = temp_data[temp_data["YEAR BUILT"] > 0]
# group the rows into {neighborhood, borough, block} sets and take the mean of the "year built" column in those sets
mean_year_by_location = mean_year_by_location.groupby(["NEIGHBORHOOD","BOROUGH","BLOCK"], as_index = False)["YEAR BUILT"].mean()
输出如下所示:
| neighborhood | borough | block | year built |
------------------------------------------------
| .... | ... | ... | ... |
| Alphabet City | 1 | 390 | 1985.342 |
| Alphabet City | 1 | 391 | 1986.76 |
| Alphabet City | 1 | 392 | 1992.8473 |
| Alphabet City | 1 | 393 | 1990.096 |
| Alphabet City | 1 | 394 | 1984.45 |
那么我怎样才能从 mean_year_by_location 数据帧中获取这些平均“年构建”值并替换原始 raw_data 数据帧中的零?
我为这篇冗长的帖子道歉。我只想说清楚。
【问题讨论】: