【发布时间】:2019-06-20 15:38:15
【问题描述】:
我正在 sklearn 管道中使用 sklearn-pandas DataFrameMapper。为了评估特征联合管道中的特征贡献,我喜欢测量估计器的系数(逻辑回归)。对于以下代码示例,三个文本内容列 a, b 和 c 被矢量化并为 X_train 选择:
import pandas as pd
import numpy as np
import pickle
from sklearn_pandas import DataFrameMapper
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
np.random.seed(1)
data = pd.read_csv('https://pastebin.com/raw/WZHwqLWr')
#data.columns
X = data.copy()
y = data.result
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
mapper = DataFrameMapper([
('a', CountVectorizer()),
('b', CountVectorizer()),
('c', CountVectorizer())
])
pipeline = Pipeline([
('featurize', mapper),
('clf', LogisticRegression(random_state=1))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(abs(pipeline.named_steps['clf'].coef_))
#array([[0.3567311 , 0.3567311 , 0.46215153, 0.10542043, 0.3567311 ,
# 0.46215153, 0.46215153, 0.3567311 , 0.3567311 , 0.3567311 ,
# 0.3567311 , 0.46215153, 0.46215153, 0.3567311 , 0.46215153,
# 0.3567311 , 0.3567311 , 0.3567311 , 0.3567311 , 0.46215153,
# 0.46215153, 0.46215153, 0.3567311 , 0.3567311 ]])
print(len(pipeline.named_steps['clf'].coef_[0]))
#24
与通常返回与特征数量相等长度的系数的多个特征的正常分析不同,DataFrameMapper 返回更大的系数矩阵。
a) 大写的总共 24 个系数是如何解释的? b) 访问每个特征 ("a","b","c") 的 coef_ 值的最佳方法是什么?
期望的输出:
a: coef_score (float)
b: coef_score (float)
c: coef_score (float)
谢谢!
【问题讨论】:
标签: scikit-learn logistic-regression sklearn-pandas coefficients