【发布时间】:2019-11-04 17:34:56
【问题描述】:
到目前为止,我有以下代码:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
我不确定如何包含我正在使用的数据,但我正在尝试预测 'size_womenswear'。我已经编码了 8 种不同的大小来预测,并且我已将此列移动到数据帧的末尾。所以y 是独立的,x 是独立的(所有其他列)
我正在使用高斯朴素贝叶斯分类器尝试对 8 种不同的尺寸进行分类,然后对 25% 的数据进行测试。结果不是很好。
我不知道为什么在处理 80,000 行时我的准确率只有 61%。我对机器学习非常陌生,希望能提供任何帮助。在这种情况下,有没有比高斯朴素贝叶斯更好的方法?
【问题讨论】:
标签: python scikit-learn classification sklearn-pandas