【问题标题】:How to get early stopping for lasso regression如何提前停止套索回归
【发布时间】:2022-01-17 20:52:13
【问题描述】:

我有问题。有没有提前停止的选项?因为我在一个情节上看到一段时间后我会过拟合,所以我想得到最优化的。

dfListingsFeature_regression = pd.read_csv(r"https://raw.githubusercontent.com/Coderanker3/dataset4/main/listings_cleaned.csv")
d = {True: 1, False: 0, np.nan : np.nan} 
dfListingsFeature_regression['host_is_superhost'] = dfListingsFeature_regression[
                                                             'host_is_superhost'].map(d).astype('int')

X = dfListingsFeature_regression.drop(columns=['host_id', 'id', 'price']) # Features
y = dfListingsFeature_regression['price'] # Target variable
print(dfListingsFeature_nor.shape)


steps = [('feature_selection', SelectFromModel(estimator=LogisticRegression(max_iter=1000))),
         ('lasso', Lasso(alpha=0.1))]

pipeline = Pipeline(steps) 

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)


parameteres = { }

grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)                
grid.fit(X_train, y_train)
                    
print("score = %3.2f" %(grid.score(X_test,y_test)))
print('Training set score: ' + str(grid.score(X_train,y_train)))
print('Test set score: ' + str(grid.score(X_test,y_test)))

# Prediction
y_pred = grid.predict(X_test)

print("RMSE Val:", metrics.mean_squared_error(y_test, y_pred, squared=False))

y_train_predict = grid.predict(X_train)
print("Train:" , metrics.mean_squared_error(y_train, y_train_predict , squared=False))

r2 = metrics.r2_score(y_test, y_pred)
print(r2)

【问题讨论】:

    标签: python scikit-learn regression lasso-regression early-stopping


    【解决方案1】:

    我认为您的意思是应用正则化。在这种情况下,我们可以通过 l1 正则化或 Lasso 回归来降低过拟合的机会。

    当你有多个特征时,这种正则化策略是一种“特征选择”,因为它会将非信息特征的系数缩小到零。

    在这种情况下,您想要找到最佳的alpha 值,以在测试数据集中找到最好的分数。此外,您可以绘制训练/测试分数之间的差距来指导您的决定。

    alpha 值越强,正则化越强。请参阅下面的代码示例。

    完整示例

    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split, cross_validate
    from sklearn.linear_model import Lasso
    
    import numpy as np
    import matplotlib.pyplot as plt
    
    X, y = make_regression(noise=4, random_state=0)
    
    # Alphas to search over
    alphas = list(np.linspace(2e-2, 1, 20))
    
    results = {}
    
    for alpha in alphas:
        
        print(f'Fitting Lasso(alpha={alpha})')
        
        estimator = Lasso(alpha=alpha, random_state=0)
    
        cv_results = cross_validate(
            estimator, X, y, cv=5, return_train_score=True, scoring='neg_root_mean_squared_error'
        )
        
        # Comput average metric value
        avg_train_score = np.mean(cv_results['train_score']) * -1
        
        avg_test_score = np.mean(cv_results['test_score']) * -1
        
        results[alpha] = (avg_train_score, avg_test_score)
    
    train_scores = [v[0] for v in results.values()]
    test_scores = [v[1] for v in results.values()]
    gap_scores = [v[1] - v[0] for v in results.values()]
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
    
    ax1.set_title('Alpha values vs Avg score')
    ax1.plot(results.keys(), train_scores, label='Train Score')
    ax1.plot(results.keys(), test_scores, label='Test Score')
    ax1.legend()
    
    ax2.set_title('Train/Test Score Gap')
    ax2.plot(results.keys(), gap_scores)
    

    请注意,当alpha 接近于零时,它是过拟合的,而当 lambda 变大时,它是欠拟合的。但是,在alpha=0.4 附近,我们可以在数据的欠拟合和过拟合之间找到一个平衡点。

    【讨论】:

    • @Test 这个答案对你的问题有帮助吗?我认为一般来说你的意思是正则化。
    猜你喜欢
    • 1970-01-01
    • 2018-04-18
    • 1970-01-01
    • 2020-11-16
    • 2021-07-20
    • 2013-05-18
    • 2019-10-11
    相关资源
    最近更新 更多