【发布时间】:2020-05-24 23:17:41
【问题描述】:
我无法在 scikit learn 中使用标签编码器对数据进行编码。
dataset.csv 有两列文本和标签
我正在尝试将数据集中的文本读入一个列表并将标签读入另一个列表并将这些列表添加到数据框,但它似乎不起作用。
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble
import pandas, xgboost, numpy, string
data = open('dataset.csv').read()
labels = []
texts = []
for i ,line in enumerate(data.split("\n")):
content = line.split("\",")
texts.append(content[0])
labels.append(content[1:])
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'],trainDF['label'],test_size = 0.2,random_state = 0)
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['texts'])
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(trainDF['texts'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
accuracy = train_model(svm.SVC(), xtrain_tfidf, train_y, xvalid_tfidf)
print(accuracy)
错误:
Traceback (most recent call last):
File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 105, in _encode
res = _encode_python(values, uniques, encode)
File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 59, in _encode_python
uniques = sorted(set(values))
TypeError: unhashable type: 'list'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Classifier.py", line 21, in <module>
train_y = encoder.fit_transform(train_y)
File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 236, in fit_transform
self.classes_, y = _encode(y, encode=True)
File "/home/crackthumb/environments/my_env/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 107, in _encode
raise TypeError("argument must be a string or number")
TypeError: argument must be a string or number
【问题讨论】:
-
看起来
labels是一个列表列表,而不是一个字符串列表,这就是问题所在。
标签: python pandas machine-learning scikit-learn nlp