【发布时间】:2020-08-15 16:02:03
【问题描述】:
这是我的数据集,Median_Price 是我的目标变量
使用 GridSearch CV 参数调整前后的 RMSE VALUE 附在代码中。如何根据我的数据集降低 RMSE?
数据集是从谷歌驱动器here下载的,我还添加了数据集的图片以供理解。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from io import StringIO
from sklearn import metrics
%matplotlib inline
dataset = pd.read_csv('E:/MMU/FYP/Property Recommendation System/Final Dataset/median/Top5_median.csv')
dataset['Median_Price'] = dataset['Median_Price'].str.replace(',', '').astype(int)
dataset['population'] = dataset['population'].apply(np.int64)
dataset['Median_Price'] = dataset['Median_Price'].apply(np.int64)
dataset['Type1'] = pd.to_numeric(dataset['Type1'], errors='coerce')
dataset['Type2'] = pd.to_numeric(dataset['Type2'], errors='coerce')
dataset = dataset.replace(np.nan, 0, regex=True)
X = dataset[['Type1','Type2','Filed Transactions', 'population', 'Jr Secure Technology']]
y = dataset['Median_Price']
from sklearn.model_selection import cross_val_score# function to get cross validation scores
def get_cv_scores(model):
scores = cross_val_score(model,
X_train,
y_train,
cv=5,
scoring='neg_mean_squared_error')
print('CV Mean: ', np.mean(scores))
print('STD: ', np.std(scores))
print('\n')
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# get cross val scores
get_cv_scores(regressor)
from sklearn.linear_model import Ridge# Train model with default alpha=1
ridge = Ridge(alpha=1).fit(X_train, y_train)# get cross val scores
get_cv_scores(ridge)
# find optimal alpha with grid search
alpha = \[9,10,11,12,13,14,15,100,1000\]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)
print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)
### Before GridSerach RMSE: 487656.3828
### After GridSerach RMSE: 453873.438
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))][1]
【问题讨论】:
标签: python data-science linear-regression data-modeling