【发布时间】:2018-09-13 08:59:23
【问题描述】:
我有 3 个单词列表,按顺序分别属于 Athlete、Comedian 和 Singer 类。我已经使用 TF*IDF 加权和 sci-kit learn 对这 3 个列表进行了矢量化,以获得下面的 x_tfidf 矩阵(训练数据):
y = ['Athlete', 'Comedian', 'Singer']
x_tfidf = [[0. 0. 0. 0. 0. 0.01707793
0.17077928 0.01707793 0.01707793 0.01707793 0.0129882 0.01707793
0. 0.02597641 0. 0. 0.01707793 0.
0. 0.06831171 0. 0. 0.0129882 0.03415586
0.01707793 0.01707793 0.03415586 0. 0.01707793 0.
0.0129882 0. 0. 0. 0. 0.
0.01707793 0.01707793 0. 0.01707793 0. 0.01707793
0. 0. 0.01707793 0. 0. 0.
0. 0. 0.01707793 0. 0.0302595 0.
0.01707793 0. 0.02597641 0. 0. 0.
0. 0.03415586 0.01707793 0.55475746 0.01707793 0.
0. 0. 0. 0. 0.01707793 0.
0. 0.01707793 0. 0. 0.01707793 0.
0. 0.03415586 0.06831171 0.01707793 0. 0.03415586
0. 0.01707793 0.0129882 0. 0. 0.01707793
0.05195282 0.02597641 0.020173 0.0129882 0.060519 0.02597641
0. 0.01707793 0. 0.55475746 0.55475746 0.01707793
0. 0.0302595 0.01707793 0. 0. 0.
0. 0.01707793 0. 0.03415586 0. 0.
0. 0.02597641 0.03415586 0.01707793 0. 0.05195282
0. 0. 0. 0. 0. 0.
0.03415586 0. 0.02597641 0.01707793 0. 0.
0. 0. 0.0129882 0. 0.03415586 0.
0.05123378]
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0.00791998 0.00791998 0. 0.
0. 0. 0. 0.03167991 0. 0.01583996
0.00602335 0. 0.00791998 0. 0. 0.
0. 0. 0. 0. 0.00791998 0.
0. 0. 0. 0.00602335 0.00791998 0.00602335
0.00602335 0.00791998 0. 0. 0.014033 0.
0. 0.01583996 0. 0. 0. 0.
0.00791998 0. 0. 0.57535302 0. 0.
0. 0. 0. 0.01807004 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.00791998 0.
0. 0. 0. 0.00791998 0. 0.
0. 0. 0.00467767 0. 0.00467767 0.
0.00791998 0. 0. 0.57535302 0.57535302 0.
0. 0.028066 0. 0. 0.01807004 0.01807004
0.03167991 0. 0.03167991 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0.00791998 0. 0.00602335
0. 0.00791998 0. 0. 0.01807004 0.00791998
0. 0. 0. 0.00791998 0. 0.
0. ]
[0.00527285 0.00527285 0.00175762 0.01230331 0.01230331 0.
0. 0. 0. 0. 0.00133671 0.
0.05800134 0.31546417 0.00175762 0.00351523 0. 0.00175762
0.00175762 0. 0. 0. 0.00133671 0.
0. 0. 0. 0. 0. 0.
0. 0.00175762 0. 0.00527285 0.00175762 0.00175762
0. 0. 0.00175762 0. 0. 0.
0.00175762 0.00527285 0. 0.00133671 0. 0.00133671
0.00133671 0. 0. 0.00175762 0.00103808 0.00175762
0. 0. 0.27268937 0.00351523 0.00351523 0.00175762
0. 0. 0. 0.11937881 0. 0.0105457
0.00527285 0.00175762 0.00175762 0.00133671 0. 0.00175762
0.00175762 0. 0.02460663 0.00527285 0. 0.00175762
0.00175762 0. 0. 0. 0. 0.
0.00175762 0. 0.00401014 0. 0.00175762 0.
0.01737726 0.29675019 0.21591993 0.00133671 0.22214839 0.31412746
0. 0. 0.00175762 0.09654112 0.11937881 0.
0.00351523 0.00207615 0. 0.00527285 0.00133671 0.00133671
0. 0. 0. 0. 0.00351523 0.00175762
0.00175762 0.00133671 0. 0. 0.00527285 0.63360177
0.00175762 0.00703047 0.0105457 0. 0.00351523 0.00935699
0. 0. 0.31412746 0. 0.00133671 0.
0.00175762 0.00175762 0.00133671 0. 0. 0.0105457
0. ]]
我的目标是测试各种分类器以比较 sci-kit learn 中各种机器学习算法的输出。也就是说,根据将用作测试数据的单词列表来预测用户是运动员、喜剧演员还是歌手。我尝试使用以下代码使用 KNN:
def classify(x_tfidf, y):
knn = neighbors.KNeighborsClassifier()
knn.fit(x_tfidf, y)
但是,我收到以下错误:
Traceback (most recent call last):
File "bow.py", line 115, in <module>
checkExists()
File "bow.py", line 28, in checkExists
get_tags(table)
File "bow.py", line 34, in get_tags
format_tags(data)
File "bow.py", line 56, in format_tags
vectorize(acc_list)
File "bow.py", line 86, in vectorize
classify(x_tag_tfidf, y)
File "bow.py", line 95, in classify
knn.fit(x_tag_tfidf, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 765, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 583, in check_X_y
check_consistent_length(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 204, in check_consistent_length
" samples: %r" % [int(l) for l in lengths])
ValueError: Found input variables with inconsistent numbers of samples: [1, 3]
我试图将“y”更改为 np 数组和 np 矩阵,但没有成功。如果有人能指出我正确的方向,我将非常感激。
【问题讨论】:
标签: python machine-learning scikit-learn classification text-classification