【发布时间】:2021-06-03 13:58:07
【问题描述】:
假设我们有一个非常简单(且愚蠢)的 LogisticRegression 模型...
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.feature_extraction.text import Splitter
import pandas as pd
import numpy as np
data_col1 = fetch_20newsgroups(
subset='train',
categories=['alt.atheism'],
remove=('headers', 'footers', 'quotes')
)
data_col2 = fetch_20newsgroups(
subset='train',
categories=['sci.space'],
remove=('headers', 'footers', 'quotes')
)
data = pd.DataFrame({
"col1": data_col1.data[:100],
"col2": data_col2.data[:100]
})
labels = np.random.randint(2, size=100)
train_data, test_data, train_labels, test_labels = train_test_split(
data,
labels,
test_size=0.1,
random_state=0,
shuffle=False
)
def title_features_pipeline():
return Pipeline([
('features', TfidfVectorizer(
analyzer='word',
stop_words='english',
use_idf=True,
# max_df=0.1,
min_df=0.01,
norm=None,
tokenizer=Splitter()
)),
], verbose=True)
pipeline = Pipeline([
('features', ColumnTransformer(
transformers = [
('col1-features', title_features_pipeline(), "col1"),
('col2-features', "drop", "col2")
],
remainder="drop",
)),
('regression', LogisticRegression(
multi_class='ovr',
max_iter=1000
))
], verbose=True)
pipeline.fit(train_data, train_labels)
pred = pipeline.predict(test_data)
print('ROC AUC = {:.3f}'.format(roc_auc_score(test_labels, pred)))
我已经花费了大量时间,通过 stackoverflow 示例和 Github 代码 sn-ps,但我无法获得任何适用于我的特定案例的东西,这让我发疯,我相信我只是做错事了!
我的目标是绘制此 LogisticalRegression 分类器的决策边界,查看每个文档属于哪个类以及在图表上将两个类分开的边界。
在此过程中,我想了解 LogisticalRegression 对来自 TfidfVectorizer 的向量究竟做了什么。这是因为到目前为止我查看的所有示例都基于只有简单标量进入分类器的假设绘制决策边界,但在这种情况下我们有长向量(tfidf)......我不明白如何将向量转换为图形上表示的单个值(是向量中所有分数的总和?还是其他)。
【问题讨论】:
标签: python matplotlib scikit-learn logistic-regression tf-idf