【发布时间】:2019-03-27 15:42:01
【问题描述】:
我尝试通过分析数据文件 Google Apps Store 来预测评级来练习线性回归,文件 csv 在 Kaggle 上。
清理并尝试应用KNeighborsRegressor运行模型后,结果,准确率和r-squared太低,我不知道为什么。
然而,预测和 y 检验之间的差异并不大,而且 MSE 非常低。
我觉得这里有一些错误,希望你能帮我改正。我希望准确率达到 90% 左右。
import re
import sys
import time
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
df = pd.read_csv('googleplaystore.csv')
df['Rating'] = df['Rating'].fillna(df['Rating'].median())
replaces = [u'\u00AE', u'\u2013', u'\u00C3', u'\u00E3', u'\u00B3', '[', ']', "'"]
for i in replaces:
df['Current Ver'] = df['Current Ver'].astype(str).apply(lambda x : x.replace(i, ''))
regex = [r'[-+|/:/;(_)@]', r'\s+', r'[A-Za-z]+']
for j in regex:
df['Current Ver'] = df['Current Ver'].astype(str).apply(lambda x : re.sub(j, '0', x))
df['Current Ver'] = df['Current Ver'].astype(str).apply(lambda x : x.replace('.', ',',1).replace('.', '').replace(',', '.',1)).astype(float)
df['Current Ver'] = df['Current Ver'].fillna(df['Current Ver'].median())
df.drop([10472], axis = 0, inplace = True)
le = preprocessing.LabelEncoder()
df['App'] = le.fit_transform(df['App'])
category_list = df['Category'].unique().tolist()
category_list = ['cat_' + word for word in category_list]
df = pd.concat([df, pd.get_dummies(df['Category'], prefix='cat')], axis=1)
df['Genres'] = df['Genres'].str.split(';').str[0]
df['Genres'].replace('Music & Audio', 'Music', inplace =True)
le = preprocessing.LabelEncoder()
df['Genres'] = le.fit_transform(df['Genres'])
le = preprocessing.LabelEncoder()
df['Content Rating'] = le.fit_transform(df['Content Rating'])
df['Price'] = df['Price'].apply(lambda x : x.strip('$'))
df['Installs'] = df['Installs'].apply(lambda x : x.strip('+').replace(',', ''))
df['Type'] = pd.get_dummies(df['Type'])
def change_size(size):
if 'M' in size:
x = size[:-1]
x = float(x)*1000000
return(x)
elif 'k' == size[-1:]:
x = size[:-1]
x = float(x)*1000
return(x)
else:
return None
df['Size'] = df['Size'].apply(change_size)
df['Size'] = df['Size'].fillna(value=df['Size'].median(), axis = 0)
df['new'] = pd.to_datetime(df['Last Updated'])
df['lastupdate'] = (df['new'] - df['new'].max()).dt.days
features = ['App', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'lastupdate','Content Rating', 'Genres', 'Current Ver']
features.extend(category_list)
X = df[features]
y = df['Rating']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 101)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
model = KNeighborsRegressor(n_neighbors=28)
predictions = model.predict(X_test)
model.fit(X_train, y_train)
accuracy = model.score(X_test,y_test)
'Accuracy: ' + str(np.round(accuracy*100, 2)) + '%'
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
result = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})
result
【问题讨论】:
标签: python regression linear-regression neighbours