我在使用 Jupyter Notebook 时遇到了 ValueError，需要帮助来找出我收到此错误的原因以及如何修复它答案

【问题标题】：I have a ValueError while using Jupyter Notebook and need help to find out why I get this error and how to fix it我在使用 Jupyter Notebook 时遇到了 ValueError，需要帮助来找出我收到此错误的原因以及如何修复它
【发布时间】：2018-04-04 16:50:32
【问题描述】：

当我运行这段代码时：

from sklearn.tree import DecisionTreeRegressor 
melbourne_model = DecisionTreeRegressor() 
melbourne_model.fit(X, y)

我得到这个输出：

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

此错误指向显示melbourne_model.fit(X, y) 的行。
我希望代码与X 和y 匹配模型，这样我就可以根据我输入的一些变量（例如建造年份、土地面积、房间/卧室、位置等）对墨尔本的房屋进行未来预测。对现在我不能这样做，因为这个错误。

我认为这是因为 X 和 y 不是 NumPy 数组，当我使用 np.asarray() 并将我想要转换的内容放入 NumPy 数组时，它不起作用。我知道这一点，因为当我写type(X) 或type(y) 时，我得到pandas.core.series.Series。

我的文件的整个代码：

import pandas as pd
import numpy as np
melbourne_file_path = 'melb_data.csv\\melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
np.asarray(melbourne_data.Price)
y = melbourne_data.Price
melbourne_predictors = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                    'YearBuilt', 'Lattitude', 'Longtitude']
np.asarray(melbourne_data[melbourne_predictors])
X = melbourne_data[melbourne_predictors]
from sklearn.tree import DecisionTreeRegressor
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(X, y)

我正在使用 Jupyter Notebook 作为 Anaconda 的一部分。

我使用的 CSV 文件可以下载here。下载文件夹后，您需要提取文件，并且 csv 在文件夹内。您可以根据文件所在的位置创建自己的melbourne_file_path。

【问题讨论】：

错误很明显。您的数据集中有 float('inf') 或 np.nan。不过可能是np.nan。检查melbourne_data.isnull().values.any()
我得到的输出是真的，那么我该如何处理这个空数据呢？
另外，如果我有float('inf')，这是什么意思，我该如何解决这个问题？

标签： python python-3.x pandas machine-learning data-science

【解决方案1】：

你得到的错误很清楚：Input contains NaN, infinity or a value too large。问题不在于您的输入是熊猫系列，而是您的数据缺少值！例如，快速浏览一下您在 Kaggle 上的 CSV，就会发现第 15 行和第 16 行缺少很多字段。

如何处理这些缺失值由您决定。一种方法是删除任何缺少 1 个或多个值的行：df.dropna(inplace=True)。这应该使 RandomForestRegressor 适合而没有错误，但如果删除了太多行，可能会使您的结果产生偏差。一种可能更好的方法是用列均值填充缺失值：df.fillna(df.mean())。

【讨论】：

我已经使用了您告诉我要使用的第二个命令，但我收到错误消息，告诉我 TypeError: stat_func() missing 1 required positional argument: 'self'。
不确定这是否相关，但您的行 np.asarray(melbourne_data.Price) 和 np.asarray(melbourne_data[melbourne_predictors]) 没有做任何事情，因为您没有将它们的输出分配给变量。您收到的错误表明您在没有初始化括号的类名上调用它（ClassName.mean() 而不是ClassName().mean()）。确保你打电话给.mean() 不是在df 上，而是在你的DataFrame 被命名的任何地方。
我已将其格式化为df.fullna(melbourne_data.mean()) 并得到TypeError: super(type, obj): obj must be an instance or subtype of type 作为错误。
@HD 您的代码中的df 是什么？你必须运行melbourne_data.fillna(melbourne_data.mean())。