如何解决：ValueError：输入包含 NaN、无穷大或对于 dtype('float32') 来说太大的值？答案

【问题标题】：How to resolve: ValueError: Input contains NaN, infinity or a value too large for dtype('float32')?如何解决：ValueError：输入包含 NaN、无穷大或对于 dtype('float32') 来说太大的值？
【发布时间】：2022-01-21 00:28:55
【问题描述】：

from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.model_selection import cross_validate
from sklearn.metrics import fbeta_score, make_scorer
import keras.backend as K
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, ClassifierMixin
import pandas as pd

class CustomThreshold(BaseEstimator, ClassifierMixin):
    """ Custom threshold wrapper for binary classification"""
    def __init__(self, base, threshold=0.5):
        self.base = base
        self.threshold = threshold
    def fit(self, *args, **kwargs):
        self.base.fit(*args, **kwargs)
        return self
    def predict(self, X):
        return (self.base.predict_proba(X)[:, 1] > self.threshold).astype(int)

dataset_clinical = np.genfromtxt("/content/drive/MyDrive/Colab Notebooks/BreastCancer-master/Data/stacked_metadata.csv",delimiter=",")
X = dataset_clinical[:,0:450]
Y = dataset_clinical[:,450]
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=1)
rf = RandomForestClassifier(n_estimators=10).fit(X,Y) 
clf = [CustomThreshold(rf, threshold) for threshold in [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]]

for model in clf:
    print(confusion_matrix(y_test, model.predict(X_test)))
for model in clf:
    print(confusion_matrix(Y, model.predict(X)))

*回溯显示如下： Traceback（最近一次调用最后一次）：

文件“RF.py”，第 33 行，在 rf = RandomForestClassifier(n_estimators=10).fit(X,Y)

文件“/usr/local/lib/python3.7/dist-packages/sklearn/ensemble/_forest.py”，第 328 行，适合 X, y, multi_output=True, accept_sparse="csc", dtype=DTYPE

文件“/usr/local/lib/python3.7/dist-packages/sklearn/base.py”，第 576 行，在 _validate_data 中 X, y = check_X_y(X, y, **check_params)

文件“/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py”，第 968 行，在 check_X_y estimator=estimator 中，

文件“/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py”，第 792 行，在 check_array_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")

文件“/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py”，第 116 行，在 _assert_all_finite type_err, msg_dtype if msg_dtype is not None else X.dtype

ValueError：输入包含 NaN、无穷大或对于 dtype('float32') 来说太大的值。 *

【问题讨论】：

标签： python numpy tensorflow random-forest ensemble-learning

【解决方案1】：

乍一看，我会说检查您的数据集是否存在缺失值、异常值等。

任何 ML 模型的很大一部分都是数据探索和预处理。我为初学者找到了一个指南。熊猫：https://towardsdatascience.com/data-visualization-exploration-using-pandas-only-beginner-a0a52eb723d5

【讨论】：

好的...谢谢。

【解决方案2】：

这可能发生在 scikit 内部，这取决于你在做什么。我建议阅读有关您正在使用的功能的文档。您可能正在使用一个取决于例如你的矩阵是正定的并且不满足那个标准。

尝试通过以下方式删除您的意外值：

np.any(np.isnan(your_matrix))
np.all(np.isfinite(your_matrix))

【讨论】：