【问题标题】:AttributeError: 'numpy.ndarray' object has no attribute 'toarray'AttributeError:“numpy.ndarray”对象没有属性“toarray”
【发布时间】:2013-12-07 17:53:27
【问题描述】:

我正在从文本语料库中提取特征,我正在使用 td-fidf 矢量化器和 scikit-learn 的截断奇异值分解来实现这一点。但是,由于我想尝试的算法需要密集矩阵并且矢量化器返回稀疏矩阵,因此我需要将这些矩阵转换为密集数组。但是,每当我尝试转换这些数组时,我都会收到一个错误消息,告诉我我的 numpy 数组对象没有属性“toarray”。我做错了什么?

功能:

def feature_extraction(train,train_test,test_set):
    vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}',ngram_range = (1,2))        

    print("fitting Vectorizer")
    vectorizer.fit(train)

    print("transforming text")
    train = vectorizer.transform(train)
    train_test = vectorizer.transform(train_test)
    test_set = vectorizer.transform(test_set)

    print("Dimensionality reduction")
    svd = TruncatedSVD(n_components = 100)
    svd.fit(train)
    train = svd.transform(train)
    train_test = svd.transform(train_test)
    test_set = svd.transform(test_set)

    print("convert to dense array")
    train = train.toarray()
    test_set = test_set.toarray()
    train_test = train_test.toarray()

    print(train.shape)
    return train,train_test,test_set

追溯:

Traceback (most recent call last):
  File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 24, in <module>
    x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)
  File "C:\Users\Anonymous\workspace\final_submission\src\Preprocessing.py", line 57, in feature_extraction
    train = train.toarray()
AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

更新: 威利指出,我对矩阵稀疏的假设可能是错误的。所以我尝试通过降维将我的数据提供给我的算法,它实际上没有任何转换就可以工作,但是当我排除降维时,它给了我大约 53k 个特征,我收到以下错误:

    Traceback (most recent call last):
  File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 28, in <module>
    result = bayesian_ridge(x_train,x_test,y_train,y_test,test_set)
  File "C:\Users\Anonymous\workspace\final_submission\src\Algorithms.py", line 84, in bayesian_ridge
    algo = algo.fit(x_train,y_train[:,i])
  File "C:\Python27\lib\site-packages\sklearn\linear_model\bayes.py", line 136, in fit
    dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 220, in check_arrays
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

谁能解释一下?

更新2

根据要求,我将提供所有涉及的代码。由于它分散在不同的文件中,我将分步发布。为清楚起见,我将保留所有模块导入。

这就是我预处理代码的方式:

def regexp(data):
    for row in range(len(data)):
        data[row] = re.sub(r'[\W_]+'," ",data[row])
        return data

def clean_the_text(data):
    alist = []
    data = nltk.word_tokenize(data)
    for j in data:
        j = j.lower()
        alist.append(j.rstrip('\n'))
    alist = " ".join(alist)
    return alist
def loop_data(data):
    for i in range(len(data)):
        data[i] = clean_the_text(data[i])
    return data  


if __name__ == "__main__":
    print("loading train")
    train_text = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"train.csv")))[:,1]))))
    print("loading test_set")
    test_set = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"test.csv")))[:,1]))))

将我的 train_set 拆分为 x_train 和 x_test 用于 cross_validation 后,我使用上面的 feature_extraction 函数转换我的数据。

x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)

最后我将它们输入到我的算法中

def bayesian_ridge(x_train,x_test,y_train,y_test,test_set):
    algo = linear_model.BayesianRidge()
    algo = algo.fit(x_train,y_train)
    pred = algo.predict(x_test)
    error = pred - y_test
    result.append(algo.predict(test_set))
    print("Bayes_error: ",cross_val(error))
    return result

【问题讨论】:

  • 如果train 已经是一个ndarray,那么你关于它返回一个稀疏矩阵的假设是不正确的。
  • 你可能是对的,让我检查一下。
  • 检查过了。现在要对我的问题进行编辑。
  • 您应该包含所有代码,而不仅仅是消息。 ndarray 根据定义是密集的,稀疏矩阵表示在不同的对象中,因此您的代码中存在相当错误(您没有附加)
  • 好的,我将添加所有涉及的代码。

标签: python numpy machine-learning scikit-learn


【解决方案1】:

TruncatedSVD.transform 返回一个数组,而不是稀疏矩阵。事实上,在当前版本的 scikit-learn 中,只有矢量化器返回稀疏矩阵。

【讨论】:

猜你喜欢
  • 2016-11-25
  • 2020-12-03
  • 2020-11-29
  • 2020-10-06
  • 2018-01-25
  • 2016-06-29
  • 2020-03-25
  • 2017-10-16
  • 2020-02-23
相关资源
最近更新 更多