【问题标题】:ValueError: setting an array element with a sequence while training KD TRee on TFIDFValueError:在 TFIDF 上训练 KD TRee 时使用序列设置数组元素
【发布时间】:2016-10-31 06:44:00
【问题描述】:

我正在尝试在文档语料库的 TF-IDF 上训练 KD-Tree,但它给出了

ValueError: setting an array element with a sequence.

代码和错误描述如下。有人可以帮我解决问题吗?

代码:

t0 = time.time()
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

t1 = time.time()
total = t1-t0
print "TF-IDF built:", total

#######################------------------------############################

t0 = time.time()
#nbrs = NearestNeighbors(n_neighbors=20, algorithm='kd_tree', metric='euclidean')
#nbrs.fit(X_train_tfidf)#,Y)
nbrs = KDTree(np.array(X_train_tfidf), leaf_size=100) 


t1 = time.time()
total = t1-t0
print "KNN Trained:", total

#######################------------------------############################

这是错误:

TF-IDF built: 0.108999967575
Traceback (most recent call last):
  File ".\tfidf_knn.py", line 48, in <module>
    nbrs = KDTree(np.array(X_train_tfidf), leaf_size=100)
  File "sklearn/neighbors/binary_tree.pxi", line 1055, in sklearn.neighbors.kd_tree.BinaryTree.__init__ (sklearn\neighbo
rs\kd_tree.c:8298)
  File "C:\Anaconda2\lib\site-packages\numpy\core\numeric.py", line 474, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

【问题讨论】:

    标签: python numpy scikit-learn knn tf-idf


    【解决方案1】:

    X_train_tfidf 是一个稀疏矩阵 (scipy.sparse),为了转换为 numpy 数组,您需要执行 . toarray() 。这个例子适合我:

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    import time
    from sklearn.neighbors import KDTree
    from scipy.sparse import csr_matrix # sparse format compatible with sklearn models
    from  sklearn.neighbors import NearestNeighbors
    
    
    import numpy as np
    X=[ 'I Love dogs' ,
    'you love cats',
    ' He loves Birds',
    ' she loves lizards',
    ' None loves me'
    ]
    t0 = time.time()
    count_vect = CountVectorizer()
    X_train_counts = count_vect.fit_transform(X)
    
    tfidf_transformer = TfidfTransformer()
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
    
    t1 = time.time()
    total = t1-t0
    print "TF-IDF built:", total
    
    #######################------------------------############################
    
    t0 = time.time()
    nbrs = KDTree(X_train_tfidf.toarray(), leaf_size=100) 
    
    ################## for sparse input we cannot use kdtree, but we can use brute #################
    #nbrs = NearestNeighbors(n_neighbors=20, algorithm='kd_tree')
    #nbrs.fit(csr_matrix(X_train_tfidf))#,Y)
    
    
    t1 = time.time()
    total = t1-t0
    print "KNN Trained:", total
    

    印刷:

    TF-IDF built: 0.00499987602234
    KNN Trained: 0.029000043869
    

    【讨论】:

    • 感谢您的帮助!然而,它适用于小数据,但是当它给它一个巨大的数组时——我得到了一个内存,因为在我执行“toarray()”之后——矩阵不再稀疏了。有没有办法给KDTree一个稀疏矩阵?
    • 嘿。看我的编辑。您不能将 kd_tree 与稀疏输入一起使用,但您可以将方法更改为 brute。结果应该不会有太大的不同。您还需要将稀疏矩阵转换为与 sklearn 模型更兼容的另一种形式(csr_matrix)。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2023-04-02
    • 2011-06-08
    • 2018-08-04
    • 2019-08-04
    相关资源
    最近更新 更多