分类器中的特征答案

【问题标题】：Features in a classifier分类器中的特征
【发布时间】：2020-11-21 15:01:07
【问题描述】：

我正在测试不同的分类器（SVM、逻辑回归、随机森林、朴素贝叶斯、梯度提升）。

我的数据集是这样的：

Text                 User         Date          Label
some text here     LucaDiMauro   2020/02/12        0
learning ML!!!     Mika          2018/12/03        1
Attention please!  user2         2012/02/04        1

等等。

1 标识正常内容； 0 识别潜在的垃圾邮件内容。

我确定了可以捕捉主题可信度的最重要特征：用户名中是否存在数字、单词数量、字符、特殊字符、代词的使用、句首数字的使用。我想知道如何使用这些选定的功能检查分类器的性能（需要一个，而不是全部）。

我的一些特点如下：

df['Punctuation']=df['Text'].str.findall('[?!<>']+')
Count = df['Text'].str.split().str.len()
df['comma_count'] = df.Text.str.count(',')
df.Text.astype(str).sum(axis=1).str.len()
df['User'] = pd.np.where(Text.str.contains("0"),"None",

我只想看看如何在模型中考虑这些特征来预测其他一些垃圾邮件/非垃圾邮件。目前尚不清楚如何在我的分类器中包含这些功能。我一直认为 Text 作为清理、预处理的变量......而我从未考虑过其他功能：只有 Label (y) 和 Text as X。

例如，我使用了这个分类器：

# Import train_test_split function
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics

# Split dataset into training set and test set

y=df['Label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify=y)

# all parameters not specified are set to their defaults
logisticRegr = LogisticRegression()
logisticRegr.fit(X_train, y_train)

logisticRegr.predict(X_test[0].reshape(1,-1))

logisticRegr.predict(X_test[0:10])

predictions = logisticRegr.predict(X_test)

score = logisticRegr.score(X_test, y_test)
print(score)

cm = metrics.confusion_matrix(y_test, predictions)
print(cm)

我想知道在这段代码中我是考虑其他功能还是只考虑文本。如果您能给我一个在分类器中集成某些功能的示例，那将非常有用。

【问题讨论】：

您尝试过任何选项吗？如果是这样，你尝试了什么。您能否与我们分享一下，以便我们了解如何帮助您解决问题。
哦，这看起来像是文本分类的案例。好的……我们开始吧。 1) 阅读与文本分类有关的特征工程。您提到您使用了字数等。但是原始计数可能并不表示垃圾邮件/无垃圾邮件。相反，请使用 TF-IDF。 2）您的电子邮件语料库中的词汇可能会迅速爆炸。还有一种情况是您的词汇表未知。使用hashing trick，因为它将解决新词可能出现在测试集中的问题。

标签： python nltk text-classification

【解决方案1】：

这取决于您所做的预处理。

如果您不确定是否包含其他功能，只需尝试打印 X 的头部，看看您是否包含其他功能。

考虑到您已经完成的预处理，您的代码很可能会考虑其他功能，除非您故意决定只考虑 X 的“文本”。

附带说明，如果您从 Text 属性中提取了有用的信息并将它们作为单独的属性，那么您可能不再需要“Text”了。

无法添加评论，因为我是新来的。您能否在问题中包含 X 的负责人？

编辑： 你可以试试这个。。

# Extract features from Text and User as per your observations..
df['Punctuation']= df['Text'].str.contains("[?!<>']+")
Count = df['Text'].str.split().str.len()
df['comma_count'] = df.Text.str.count(',')
df['UserHasDigit'] = df['User'].apply(lambda x: 1 if(any(char.isdigit() for char in x)) == True else 0)

# Add more if you find any useful features

# Split the dependent (Target/Label) column from independent (features) columns
y = df['Label']
X = df.drop(columns=['Text', 'User', 'Label', 'Date']) # Drop attributes from which you extracted features and the ones that add no value

# print the head of X and y to see if it is correct
print("X")
print(X.head())
print("------")
print("y")
print(y.head())

# Apply any encoding if needed (Label-encoding / one-hot encoding)
# Now, apply test train split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify=y)

# carry on with the modeling as usual..

# ----- copied from your code ------
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
.
.
.
.

在此代码中，它考虑了所有功能（标点符号、逗号计数、UserHasDigit.. 如果您添加了更多功能，则更多）而不是“文本”，因为它已被删除。

【讨论】：