sklearn 绘制 tfidf 二元 LogisticRegression 分类器的决策边界答案

【问题标题】：sklearn plot decision boundary for tfidf binary LogisticRegression classifiersklearn 绘制 tfidf 二元 LogisticRegression 分类器的决策边界
【发布时间】：2021-06-03 13:58:07
【问题描述】：

假设我们有一个非常简单（且愚蠢）的 LogisticRegression 模型...

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.feature_extraction.text import Splitter
import pandas as pd
import numpy as np

data_col1 = fetch_20newsgroups(
    subset='train',
    categories=['alt.atheism'],
    remove=('headers', 'footers', 'quotes')
)
data_col2 = fetch_20newsgroups(
    subset='train',
    categories=['sci.space'],
    remove=('headers', 'footers', 'quotes')
)

data = pd.DataFrame({
    "col1": data_col1.data[:100],
    "col2": data_col2.data[:100]
})
labels = np.random.randint(2, size=100)
train_data, test_data, train_labels, test_labels = train_test_split(
    data,
    labels,
    test_size=0.1,
    random_state=0,
    shuffle=False
)


def title_features_pipeline():
    return Pipeline([
        ('features', TfidfVectorizer(
            analyzer='word',
            stop_words='english',
            use_idf=True,
            # max_df=0.1,
            min_df=0.01,
            norm=None,
            tokenizer=Splitter()
        )),
    ], verbose=True)


pipeline = Pipeline([
        ('features', ColumnTransformer(
            transformers = [
                ('col1-features', title_features_pipeline(), "col1"),
                ('col2-features', "drop", "col2")
            ],
            remainder="drop",
        )),
        ('regression', LogisticRegression(
            multi_class='ovr',
            max_iter=1000
        ))
], verbose=True)

pipeline.fit(train_data, train_labels)
pred = pipeline.predict(test_data)
print('ROC AUC = {:.3f}'.format(roc_auc_score(test_labels, pred)))

我已经花费了大量时间，通过 stackoverflow 示例和 Github 代码 sn-ps，但我无法获得任何适用于我的特定案例的东西，这让我发疯，我相信我只是做错事了！

我的目标是绘制此 LogisticalRegression 分类器的决策边界，查看每个文档属于哪个类以及在图表上将两个类分开的边界。

在此过程中，我想了解 LogisticalRegression 对来自 TfidfVectorizer 的向量究竟做了什么。这是因为到目前为止我查看的所有示例都基于只有简单标量进入分类器的假设绘制决策边界，但在这种情况下我们有长向量（tfidf）......我不明白如何将向量转换为图形上表示的单个值（是向量中所有分数的总和？还是其他）。

【问题讨论】：

标签： python matplotlib scikit-learn logistic-regression tf-idf

【解决方案1】：

逻辑回归将学习 tfidf 矢量化器中每个术语的标量值。通过将权重乘以 tfidf 分数并将它们全部相加，将向量转换为分数。

绘制决策边界通常在两个或三个维度上完成。如果您有一个可能有数百个维度的文本分类器，那么绘制决策边界对您意味着什么就不是那么清楚了。

【讨论】：

感谢您的回复，有道理。所以我要从 LogisticRegression 中提取它：The vectors are converted to a score by multiplying the weight by the tfidf score and summing them all up。这样我就可以，如你所说，Plotting decision boundaries is something that is commonly done in two or three dimensions。如果我可以重现 LogisticRegression 对向量的作用，我可以轻松地绘制它们。