CountVectorizer token_pattern 不捕捉下划线答案

【问题标题】：CountVectorizer token_pattern to not catch underscoreCountVectorizer token_pattern 不捕捉下划线
【发布时间】：2021-08-23 16:03:38
【问题描述】：

CountVectorizer 默认标记模式将下划线定义为字母

corpus = ['The rain in spain_stays' ]
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w\w+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

给予：

['in', 'rain', 'spain_stays', 'the']

这是有道理的，因为 AFAIK '/w' 等同于 [a-zA-z0-9_]，我需要的是：

['in', 'rain', 'spain', 'stays', 'the']

所以我尝试用 [a-zA-Z0-9] 替换“/w”

vectorizer = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9][a-zA-Z0-9]+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

但我明白了

['in', 'rain', 'the']

我怎样才能得到我需要的东西？欢迎任何想法

【问题讨论】：

\w 也匹配 _ 所以这两个字符之间没有单词边界n_
那么我可以使用什么来代替 '/w' 来获得所需的结果？
没有单词边界，您可以使用例如[^\W_]+ regex101.com/r/zN3Oax/1
或者使用lookarounds形式的边界(?:(?<=[\s_])|(?<=^))[^\W_]+(?=[\s_]|$)regex101.com/r/QaREpI/1
谢谢，工作。两者有区别吗？

标签： python regex scikit-learn countvectorizer

【解决方案1】：

n_ 之间没有单词边界，因为\w 也匹配下划线。

匹配 2 个或多个不带下划线的单词字符，并且只允许左右两边有空格或下划线：

(?<![^\s_])[^\W_]{2,}(?![^\s_])

模式匹配：

(?<![^\s_]) 否定后视，断言左边的空白边界或下划线
[^\W_]{2,} 匹配单词字符 2 次或多次，不包括下划线
(?![^\s_]) 负前瞻，在右侧断言空白边界或下划线

查看regex demo。

一个非常广泛的匹配可能是[^\W_]{2,}，但请注意，这没有考虑边界。它只匹配不带下划线的单词字符。

在此regex demo 中查看不同数量的匹配项。

【讨论】：