如何从 Countvectorizer token_pattern 中保留 #hashtag 和 @mention 特征答案

【问题标题】：How to preserve #hashtag and @mention characterizers from Countvectorizer token_pattern如何从 Countvectorizer token_pattern 中保留 #hashtag 和 @mention 特征
【发布时间】：2019-07-12 16:39:33
【问题描述】：

我使用 sklearn 库从推文中提取字数。但是我在删除一些特殊字符时遇到了问题。我想保留来自 CountVectorizer 对象的 '#' 和 '@' 字符。

默认token_pattern参数为：token_pattern='(?u)\b\w\w+\b'

例如在这个语料库上......

['@terör @terör #terör ak @terör ali ali ...']

...输出为：

['ak', 'ali', 'terör', ...]

CountVectorizer 的默认正则表达式会删除特殊字符。如何保留这些字符？

【问题讨论】：

标签： python scikit-learn tokenize hashtag countvectorizer

【解决方案1】：

我用 ; 更改参数

token_pattern=r'\b\w\w+\b|(?<!\w)@\w+|(?<!\w)#\w+')

输出随心所欲；

['@terör', '#terör', ...]

【讨论】：