【发布时间】:2021-09-12 17:41:10
【问题描述】:
我有一个看起来像这样的数据框(显然要大得多):
id points isAvailable frequency Score
abc1 325 0 93 0.01
def2 467 1 80 0.59
ghi3 122 1 90 1
jkl4 546 1 84 0
mno5 355 0 93 0.99
我想看看points、isAvailable 和frequency 对Score 的影响有多大。我想使用像in this example这样的随机森林:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
#from sklearn.inspection import permutation_importance
#import shap
from matplotlib import pyplot as plt
plt.rcParams.update({'figure.figsize': (12.0, 8.0)})
plt.rcParams.update({'font.size': 14})
list_of_columns = ['points','isAvailable', 'frequency']
X = df[list_of_columns]
target_column = 'Score'
y = df[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=12)
rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)
rf.feature_importances_ #the array below is the output
>>> array([0.44326132, 0.01666047, 0. , 0.5400782 ])
plt.barh(df.columns, rf.feature_importances_)
在最后一行我收到以下错误:ValueError: shape mismatch: objects cannot be broadcast to a single shape。我应该在一开始就创建这些列吗? (更大的)数据有问题吗?
【问题讨论】:
标签: python pandas dataframe matplotlib scikit-learn