如何在 sklearn 中进行多词标记化？答案

【问题标题】：How can I do multiword tokenization in sklearn?如何在 sklearn 中进行多词标记化？
【发布时间】：2021-05-03 18:08:07
【问题描述】：

我正在查看 sklearn 中的标记器，即 CountVectorizer 和 DictVectorizer。我希望能够在执行 TF-IDF 之前调试我的令牌计数。但是，我在将 nltk.multiword tokenizer 转换为 scikit learn 时遇到了困难。

目前，我有以下内容：

from nltk.tokenize import MWETokenizer

tokenizer = MWETokenizer()
tokens = ["New York", "Albany", "Buffalo", "Hudson River"]
for t in tokens:
  if t.split(" "):
    print(t.split(" "))
    tokenizer.add_mwe((t.split(" ")))
  else:
    tokenizer.add_mwe(t)


# Small corpus
corpus = [
  'This is a new document about New York and the Hudson River.',
  'This is a document about California instead.'
  ]
[tokenizer.tokenize(c.split()) for c in corpus]

我得到：

[['This', 'is', 'a', 'new', 'document', 'about', 'New_York', 'and', 'the', 'Hudson', 'River.'],
 ['This', 'is', 'a', 'document', 'about', 'California', 'instead.']]

需要标点符号处理但将“纽约”识别为单个标记，太棒了。

尝试申请类似于CountVectorizer，我发现...

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(vocabulary=tokens, lowercase=False)
# >>> CountVectorizer(vocabulary=['New York', 'Albany', 'Buffalo', 'Hudson River'])

vectorizer.fit_transform(corpus).toarray()
# array([[0, 0, 0, 0],
#        [0, 0, 0, 0]])

这是错误的。如何使用 CountVectorizer（最终在 sklearn 中使用 TfIDFVectorizer）获取我的（多字）词典的计数？

【问题讨论】：

标签： python scikit-learn tokenize

【解决方案1】：

您可能需要手动指定 ngram。不知道这是否正确：

from sklearn.feature_extraction.text import CountVectorizer
ng_min = max(min(map(lambda x: len(x.split()), tokens)),1)
ng_max = max(map(lambda x: len(x.split()), tokens))
vectorizer = CountVectorizer(vocabulary=tokens, lowercase=False, ngram_range=(ng_min, ng_max))
vectorizer.fit_transform(corpus).toarray()

产量：

array([[1, 0, 0, 1],
       [0, 0, 0, 0]])

【讨论】：

我相信是因为 default=(1, 1) 根据文档