字典哈希内存错误和特征哈希浮点错误答案

【问题标题】：dictionary hashing memory error and feature hashing float error字典哈希内存错误和特征哈希浮点错误
【发布时间】：2018-01-18 03:24:08
【问题描述】：

这是我的数据 [作为 pandas df]：

print(X_train[numeric_predictors + categorical_predictors].head())：

        bathrooms  bedrooms   price                       building_id  \
10            1.5       3.0  3000.0  53a5b119ba8f7b61d4e010512e0dfc85   
10000         1.0       2.0  5465.0  c5c8a357cba207596b04d1afd1e4f130   
100004        1.0       1.0  2850.0  c3ba40552e2120b0acfc3cb5730bb2aa   
100007        1.0       1.0  3275.0  28d9ad350afeaab8027513a3e52ac8d5   
100013        1.0       4.0  3350.0                                 0  

99993         1.0       0.0   3350.0  ad67f6181a49bde19218929b401b31b7   
99994         1.0       2.0   2200.0  5173052db6efc0caaa4d817112a70f32   


                              manager_id  
10      5ba989232d0489da1b5f2c45f6688adc  
10000   7533621a882f71e25173b27e3139d83d  
100004  d9039c43983f6e564b1482b273bd7b01  
100007  1067e078446a7897d2da493d2f741316  
100013  98e13ad4b495b9613cef886d79a6291f  
...
99993   9fd3af5b2d23951e028059e8940a55d7  
99994   d7f57128272bfd82e33a61999b5f4c42

最后两列是分类预测变量。

同样，打印 pandas 系列 X_train[target]：

10        medium
10000        low
100004      high
100007       low
100013       low
...
99993        low
99994        low

我正在尝试使用管道模板，但在使用散列矢量化器时出现错误。

首先，这是我的字典哈希，它给了我一个 MemoryError：

from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
dv.fit(feature_dict)
out = pd.DataFrame(
    dv.transform(feature_dict),
    columns = dv.feature_names_
)

所以在下一个单元格中，我使用以下代码作为我的特征哈希编码器：

from sklearn.feature_extraction import FeatureHasher

fh = FeatureHasher(n_features=2)
feature_dict = X_train[categorical_predictors].to_dict(orient='records')
fh.fit(feature_dict)
out = pd.DataFrame(fh.transform(feature_dict).toarray())
#print out.head()

注释掉的打印行为我提供了一个 DataFrame，其特征行在每行 2 个单元格中的每个单元格中包含 -1.0、0.0 或 1.0 浮点数。

这是我将字典和特征哈希组合在一起的矢量化器：

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import FeatureHasher, DictVectorizer

class MyVectorizer(BaseEstimator, TransformerMixin):
    """
    Vectorize a set of categorical variables
    """

    def __init__(self, cols, hashing=None):
        """
        args:
            cols: a list of column names of the categorical variables
            hashing: 
                If None, then vectorization is a simple one-hot-encoding.
                If an integer, then hashing is the number of features in the output.
        """
        self.cols = cols
        self.hashing = hashing

    def fit(self, X, y=None):

        data = X[self.cols]

        # Choose a vectorizer
        if self.hashing is None:
            self.myvec = DictVectorizer(sparse=False)
        else:
            self.myvec = FeatureHasher(n_features = self.hashing)

        self.myvec.fit(X[self.cols].to_dict(orient='records'))
        return self

    def transform(self, X):

        # Vectorize Input
        if self.hashing is None:
            return pd.DataFrame(
                self.myvec.transform(X[self.cols].to_dict(orient='records')),
                columns = self.myvec.feature_names_
            )
        else:
            return pd.DataFrame(
                self.myvec.transform(X[self.cols].to_dict(orient='records')).toarray()
            )

我把它们放在我的管道中：

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion

pipeline = Pipeline([
    ('preprocess', FeatureUnion([
        ('numeric', Pipeline([
            ('scale', StandardScaler())
        ])
        ),
        ('categorical', Pipeline([
            ('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
        ])
        )
    ])),
    ('predict', MultinomialNB(alphas))
])

和alpha参数：

alphas = {
    'predict__alpha': [.01, .1, 1, 2, 10]
}

并使用 gridsearchCV，当我在此处的第三行拟合它时遇到错误：

print X_train.head(), train_data[target]
grid_search = GridSearchCV(pipeline, param_grid=alphas,scoring='accuracy')
grid_search.fit(X_train[numeric_predictors + categorical_predictors], X_train[target])
grid_search.best_params_

ValueError：无法将字符串转换为浮点数：d7f57128272bfd82e33a61999b5f4c42

【问题讨论】：

您能否添加一些发生此错误的示例数据？另外请编辑代码以提供完整的代码并按照您使用的顺序，以便我们轻松复制粘贴和调试。
你好，我按照你的建议做了。请看一下，让我知道，谢谢！
请帮助我仍然收到此错误。
我要求您添加完整的错误堆栈跟踪。但相反，您发布了一个没有额外信息的新问题。无论如何，这个错误是由于 StandardScaler。您正在将所有数据发送到 StandardScaler

标签： python scikit-learn pipeline grid-search

【解决方案1】：

错误是由 StandardScaler 引起的。您正在将所有数据发送到其中，这是错误的。在您的管道中，在 FeatureUnion 部分中，您选择了 MyVectorizer 的分类列，但没有为 StandardScaler 进行任何选择，因此所有列都进入其中，这导致了错误。此外，由于内部管道仅由单个步骤组成，因此不需要管道。

第一步，将管道更改为：

pipeline = Pipeline([
    ('preprocess', FeatureUnion([
        ('scale', StandardScaler()),
        ('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
    ])),
    ('predict', MultinomialNB())
])

这仍然会抛出同样的错误，但它现在看起来不那么复杂了。

现在我们需要的只是可以选择要提供给 StandardScaler 的列（数字列），这样就不会引发错误。

我们可以通过多种方式做到这一点，但我会遵循您的编码风格，并将创建一个新课程MyScaler，并进行更改。

class MyScaler(BaseEstimator, TransformerMixin):

    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):

        self.scaler = StandardScaler()
        self.scaler.fit(X[self.cols])
        return self

    def transform(self, X):
        return self.scaler.transform(X[self.cols])

然后将管道改为：

numeric_predictors=['bathrooms','bedrooms','price']
categorical_predictors = ['building_id','manager_id']

pipeline = Pipeline([
    ('preprocess', FeatureUnion([
        ('scale', MyScaler(cols=numeric_predictors)),
        ('vectorize', MyVectorizer(cols=['categorical_predictors'], hashing=None))
    ])),
    ('predict', MultinomialNB())
])

仍然会引发错误，因为您已将 categorical_predictors 作为字符串提供给MyVectorizer，而不是作为列表。改成喜欢我在MyScaler做的：改

MyVectorizer(cols=['categorical_predictors'], hashing=None))

到：-

MyVectorizer(cols=categorical_predictors, hashing=None)

现在您的代码已准备好按语法执行。但是现在您已经使用MultinomialNB() 作为您的预测器，它只需要特征中的正值。但是由于 StandardScaler 将数据缩放为零均值，它会将一些值转换为负数，并且您的代码将再次无法工作。那件事你需要决定做什么..也许把它改成 MinMaxScaler。

【讨论】：

您好，我清理了一下，但仍然遇到与以前类似的问题：stackoverflow.com/questions/45723699/…