【问题标题】:sklearn's yeo-johnson PowerTransformer throws "ValueError: Input contains infinity" when data has no large/inf/nan values当数据没有大/inf/nan 值时,sklearn 的 yeo-johnson PowerTransformer 会抛出“ValueError:输入包含无穷大”
【发布时间】:2021-08-28 18:06:54
【问题描述】:

sklearn (0.21.3; python 3.6) 中 PowerTransformer 中的 Yeo-Johnson 方法引发错误

ValueError: Input contains infinity or a value too large for dtype('float64').

即使数据完全有效。我忽略了什么吗?或者这是一个错误?

要重现的代码:

import sklearn
from sklearn.preprocessing import PowerTransformer
import numpy as np
import pandas as pd

print(f"sklearn version = {sklearn.__version__}")

data = np.array([1000]*100 + [980]).reshape(-1, 1)
print(f"Data stats:\n{pd.DataFrame(data).describe()}")

## Powertransform. It will give an error: "Input contains infinity or a value too large for dtype('float64')"
pt = PowerTransformer(method="yeo-johnson")
pt.fit(data)

我得到的输出:

sklearn version = 0.21.3
Data stats:
                 0
count   101.000000
mean    999.801980
std       1.990074
min     980.000000
25%    1000.000000
50%    1000.000000
75%    1000.000000
max    1000.000000
/home/jupyter/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py:2828: RuntimeWarning:

overflow encountered in power

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-51-e81214808bec> in <module>()
      8 ## Powertransform. It will give ""
      9 pt = PowerTransformer(method="yeo-johnson")
---> 10 pt.fit(data)

~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in fit(self, X, y)
   2672         self : object
   2673         """
-> 2674         self._fit(X, y=y, force_transform=False)
   2675         return self
   2676 

~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in _fit(self, X, y, force_transform)
   2703                 X = self._scaler.fit_transform(X)
   2704             else:
-> 2705                 self._scaler.fit(X)
   2706 
   2707         return X

~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in fit(self, X, y)
    637         # Reset internal state before fitting
    638         self._reset()
--> 639         return self.partial_fit(X, y)
    640 
    641     def partial_fit(self, X, y=None):

~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in partial_fit(self, X, y)
    661         X = check_array(X, accept_sparse=('csr', 'csc'), copy=self.copy,
    662                         estimator=self, dtype=FLOAT_DTYPES,
--> 663                         force_all_finite='allow-nan')
    664 
    665         # Even in the case of `with_mean=False`, we update the mean anyway

~/.local/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    540         if force_all_finite:
    541             _assert_all_finite(array,
--> 542                                allow_nan=force_all_finite == 'allow-nan')
    543 
    544     if ensure_min_samples > 0:

~/.local/lib/python3.6/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan)
     54                 not allow_nan and not np.isfinite(X).all()):
     55             type_err = 'infinity' if allow_nan else 'NaN, infinity'
---> 56             raise ValueError(msg_err.format(type_err, X.dtype))
     57     # for object dtype data, we only check for NaNs (GH-13254)
     58     elif X.dtype == np.dtype('object') and not allow_nan:

ValueError: Input contains infinity or a value too large for dtype('float64').

我看过其他帖子 herehereinf 值。在这种情况下,没有大于 1000 的值。

【问题讨论】:

    标签: python scikit-learn


    【解决方案1】:

    这不是错误,而是因为PowerTransformer 的内部结构。查看错误堆栈跟踪的这些行:

    ~/.local/lib/python3.6/site-packages/sklearn/preprocessing/data.py in _fit(self, X, y, force_transform)
       2703                 X = self._scaler.fit_transform(X)
       2704             else:
    -> 2705                 self._scaler.fit(X)
       2706 
       2707         return X
    

    PowerTransformerstandardize 参数默认设置为 true。在这种情况下,提供的数据将在调用fit 期间进行转换,然后转换后的数据将通过StandardScaler 进行缩放(参见源代码here)。

    现在的问题是您转换后的数据将变成inf 值的数组。您可以通过使用 scipy 的相应 yeojohnson 方法获取数据的 Yeo-Johnson 转换的 lambda 并检查转换来确认这一点:

    from scipy.stats import yeojohnson
    import numpy as np
    
    
    data = np.array([1000]*100 + [980])
    
    _, lmbda = yeojohnson(data)
    print(lmbda)  # 291.47777013
    
    data_t = (np.power(data + 1, lmbda) - 1) / lmbda 
    

    data_t 是 Yeo-Johnson 转换的结果,仅包含 inf 值。这现在被传递给Standardscaler 并抱怨它的“输入”确实包含inf 值。因此,它不是在抱怨您的原始数据,而是在抱怨转换后的数据。

    您可以通过设置standardize=False 来避免这种行为,它会正常工作:

    from sklearn.preprocessing import PowerTransformer
    import numpy as np
    
    
    data = np.array([1000]*100 + [980]).reshape(-1, 1)
    
    pt = PowerTransformer(method="yeo-johnson", standardize=False)
    data_t = pt.fit_transform(data)
    

    但是,与RunTimeWarning 一起,您仍然会得到一个包含inf 值的数组,这些值可能根本没有用。但这不是因为一些错误,而是转换的实际结果。

    【讨论】:

    • @VinayKolar 这能回答你的问题吗?
    • 谢谢@afsharov。这就解释了为什么。对于此数据,lambda 似乎太高了。设置standardize=False 没有用,就像你提到的那样。
    猜你喜欢
    • 2020-10-23
    • 1970-01-01
    • 2019-05-26
    • 2019-04-10
    • 2019-10-12
    • 2019-10-11
    • 2017-11-23
    • 2016-07-31
    相关资源
    最近更新 更多