【问题标题】：Assignment with both fillna() and loc() apparently not workingfillna() 和 loc() 的赋值显然不起作用
【发布时间】：2020-05-15 03:47:17
【问题描述】：

我到处寻找答案，但找不到。

我的目标：我正在尝试填充 DataFrame 中的一些缺失值，使用监督学习来决定如何填充它。

我的代码如下所示：注意 - 这第一部分并不重要，它只是提供上下文

train_df = df[df['my_column'].notna()]     #I need to train the model without using the missing data
train_x = train_df[['lat','long']]         #Lat e Long are the inputs
train_y = train_df[['my_column']]          #My_column is the output
clf = neighbors.KNeighborsClassifier(2)
clf.fit(train_x,train_y)                   #clf is the classifies, here we train it
df_x = df[['lat','long']]                  #I need this part to do the prediction
prediction = clf.predict(df_x)             #clf.predict() returns an array
series_pred = pd.Series(prediction)        #now the array is a series
print(series_pred.shape)                   #RETURNS (2381,)
print(series_pred.isna().sum())            #RETURN 0

到目前为止，一切都很好。我有我的 2381 个预测（我只需要其中的几个）里面没有 NaN 值（为什么预测中会有 NaN 值？我只是想确定一下，因为我不明白我的错误）

在这里我尝试将预测分配给我的数据框：

#test_1
df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred  #I assign the predictions using .loc()
#test_2
df['my_colum'] =  df['my_colum'].fillna(series_pred)     #Double check: I assign the predictions using .fillna()
print(df['my_colum'].shape)                      #RETURNS (2381,)
print(df['my_colum'].isna().sum())               #RETURN 6

如您所见，i没有用：缺失值仍然是 6。我随机尝试了一种稍微不同的方法：

#test_3
df[['my_colum']] =  df[['my_colum']].fillna(series_pred)     #Will it work?
print(df[['my_colum']].shape)                        #RETURNS (2381, 1)
print(df[['my_colum']].isna().sum())                 #RETURNS 6

没有用。我决定尝试最后一件事：甚至在 将结果分配 到原始 df 之前检查 fillna 结果：

In[42]:
print(df['my_colum'].fillna(series_pred).isna().sum())  #extreme test
Out[42]:
6

那么...我的非常非常愚蠢的错误在哪里？非常感谢

编辑 1

为了显示一点数据，

In[1]:
df.head()
Out[1]:
      my_column      lat    long
 id                                                     
9df   Wil            51     5
4f3   Fabio          47     9
x32   Fabio          47     8   
z6f   Fabio          47     9  
a6f   Giovanni       47     7

另外，我在问题的开头添加了信息

【问题讨论】：

嗨 Federico，您能否发布一份您正在使用的数据样本？也许表格的输出也是如此。
series_pred的索引（行索引）是否匹配df？
也不应该是df.loc[df['my_colum'].isna(), 'my_colum'] = series_pred[df['my_colum'].isna()]吗？还有df和df_x有什么区别？
我会重置索引，以便它们匹配...series_pred.index = df.index。我猜想fillna 之类的匹配索引而不是位置。
@Dan 是对的，当在fillna 中使用系列时，它是索引对齐的。如果您确定数据的大小，那么df.loc[df['my_colum'].isna(), 'my_colum'] = prediction 应该这样做，无需创建系列

标签： python pandas numpy supervised-learning fillna

【解决方案1】：

@Ben.T 或@Dan 应该发布他们自己的答案，他们应该被接受为正确的答案。

按照他们的提示，我想说有两种解决方案：

解决方案 1（最佳）：使用 loc()

问题

当前解决方案的问题是df.loc[df['my_column'].isna(), 'my_column'] 期望接收 X 个值，其中 X 是缺失值的数量。我的变量prediction 实际上既有缺失值的预测，也有非缺失值的预测

解决方案

pred_df = df[df['my_column'].isna()]        #For the prediction, use a Dataframe with only the missing values. Problem solved
df_x = pred_df[['lat','long']]
prediction = clf.predict(df_x)
df.loc[df['my_column'].isna(), 'my_column'] = prediction

解决方案 2：使用 fillna()

问题

当前解决方案的问题是df['my_colum'].fillna(series_pred) 要求我的df 的索引与series_pred 相同，在这种情况下这是不可能的，除非你的df中有一个简单的索引，比如[0 , 1, 2, 3, 4...]

解决方案

在代码的最开始重置df的索引。

为什么这不是最好的

最简洁的方法是仅在需要时进行预测。这种方法用loc()很容易得到，不知道怎么用fillna()得到，因为需要通过分类保存索引

编辑：series_pred.index = df['my_column'].isna().index 谢谢@Dan

【讨论】：

这不是不可能的，你只需要做series_pred.index = df['my_column'].isna().index 然后fillna 就可以了。但就我个人而言，我也会坚持切片解决方案。