在线性回归建模中，为什么我的 RMSE 值这么大？答案

【问题标题】：In Linear Regression Modeling why my RMSE Value is so large?在线性回归建模中，为什么我的 RMSE 值这么大？
【发布时间】：2020-08-15 16:02:03
【问题描述】：

这是我的数据集，Median_Price 是我的目标变量使用 GridSearch CV 参数调整前后的 RMSE VALUE 附在代码中。如何根据我的数据集降低 RMSE？

数据集是从谷歌驱动器here下载的，我还添加了数据集的图片以供理解。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seabornInstance
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction import DictVectorizer
from io import StringIO
from sklearn import metrics
%matplotlib inline

dataset = pd.read_csv('E:/MMU/FYP/Property Recommendation System/Final Dataset/median/Top5_median.csv')

dataset['Median_Price'] = dataset['Median_Price'].str.replace(',', '').astype(int)

dataset['population'] = dataset['population'].apply(np.int64)
dataset['Median_Price'] = dataset['Median_Price'].apply(np.int64)

dataset['Type1'] = pd.to_numeric(dataset['Type1'], errors='coerce')
dataset['Type2'] = pd.to_numeric(dataset['Type2'], errors='coerce')
dataset = dataset.replace(np.nan, 0, regex=True)

X = dataset[['Type1','Type2','Filed Transactions', 'population', 'Jr Secure Technology']]

y = dataset['Median_Price']

from sklearn.model_selection import cross_val_score# function to get cross validation scores
def get_cv_scores(model):
    scores = cross_val_score(model,
                             X_train,
                             y_train,
                             cv=5,
                             scoring='neg_mean_squared_error')

    print('CV Mean: ', np.mean(scores))
    print('STD: ', np.std(scores))
    print('\n')

regressor = LinearRegression()
regressor.fit(X_train, y_train)

# get cross val scores
get_cv_scores(regressor)

from sklearn.linear_model import Ridge# Train model with default alpha=1
ridge = Ridge(alpha=1).fit(X_train, y_train)# get cross val scores
get_cv_scores(ridge)

# find optimal alpha with grid search
alpha = \[9,10,11,12,13,14,15,100,1000\]
param_grid = dict(alpha=alpha)
grid = GridSearchCV(estimator=ridge, param_grid=param_grid, scoring='neg_mean_squared_error', verbose=1, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)
print('Best Score: ', grid_result.best_score_)
print('Best Params: ', grid_result.best_params_)
### Before GridSerach RMSE: 487656.3828
### After GridSerach RMSE: 453873.438


coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_df

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))][1]

Dataset CSV download link

【问题讨论】：

标签： python data-science linear-regression data-modeling

【解决方案1】：

嗯，使用GridSearchCV后RMSE值似乎有一定的下降。

您可以尝试特征选择、特征工程、缩放数据、转换，尝试一些其他算法，这些可能会在一定程度上帮助您降低 RMSE 值。

此外，RMSE 值完全取决于数据的上下文。似乎您的数据点彼此分开很远，这为您提供了非常高的 RMSE 值。我上面提到的不同技术只能在有限程度上帮助您降低 RMSE。

【讨论】：