【发布时间】:2018-09-03 13:31:38
【问题描述】:
我正在使用此代码:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
import re
使用 TFIDF 向量化
from sklearn.feature_extraction.text import TfidfVectorizer
tv=TfidfVectorizer(max_df=0.5,min_df=2,stop_words='english')
加载数据文件
df=pd.read_json('train.json',orient='columns')
test_df=pd.read_json('test.json',orient='columns')
df['seperated_ingredients'] = df['ingredients'].apply(','.join)
test_df['seperated_ingredients'] = test_df['ingredients'].apply(','.join)
df['seperated_ingredients']=df['seperated_ingredients'].str.lower()
test_df['seperated_ingredients']=test_df['seperated_ingredients'].str.lower()
cuisines={'thai':0,'vietnamese':1,'spanish':2,'southern_us':3,'russian':4,'moroccan':5,'mexican':6,'korean':7,'japanese':8,'jamaican':9,'italian':10,'irish':11,'indian':12,'greek':13,'french':14,'filipino':15,'chinese':16,'cajun_creole':17,'british':18,'brazilian':19 }
df.cuisine= [cuisines[item] for item in df.cuisine]
做预处理
ho=df['seperated_ingredients']
ho=ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho=ho.replace('\'"',regex=True)
ho=tv.fit_transform(ho)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(ho,df['cuisine'],random_state=0)
from sklearn.linear_model import LogisticRegression
clf= LogisticRegression(penalty='l1')
clf.fit(X_train, y_train)
clf.score(X_test,y_test)
from sklearn.linear_model import LogisticRegression
clf1= LogisticRegression(penalty='l1')
clf1.fit(ho,df['cuisine'])
hs=test_df['seperated_ingredients']
hs=hs.replace(r'#([^\s]+)', r'\1', regex=True)
hs=hs.replace('\'"',regex=True)
hs=tv.fit_transform(hs)
ss=clf1.predict(hs) # this line is giving error.
在预测时得到上述错误。有谁知道我做错了什么?
【问题讨论】:
-
错误说,你需要有 X 个单位的东西,但你试图用 Y 个单位做事。您能否将代码发布到您正在加载数据和内容的位置?
-
df=pd.read_json('train.json',orient='columns') test_df=pd.read_json('test.json',orient='columns')
-
我越来越准确了。
-
您创建 X_train、y_train、ho 和其他东西的行在哪里?请添加完整的程序并添加哪一行为您创建了问题。
-
更新了请查收。
标签: python-3.x tf-idf