【发布时间】:2020-03-13 11:17:40
【问题描述】:
现在我正在尝试使用gensim Phrases,以便根据我自己的语料库学习短语/特殊含义。
假设我有与汽车品牌相关的语料库,通过去除标点符号和停用词,标记句子,例如: p>
sent1 = 'aston martin is a car brand'
sent2 = 'audi is a car brand'
sent3 = 'bmw is a car brand'
...
这样,我想用gensim Phrases来学习,输出如下:
from gensim.models import Phrases
sents = [sent1, sent2, sent3, ...]
sents_stream = [sent.split() for sent in sents]
bigram = Phrases(sents_stream)
for sent in sents:
print(bigram [sent])
# Ouput should be like:
['aston_martin', 'car', 'brand']
['audi', 'car', 'brand']
['bmw', 'car', 'brand']
...
但是,如果很多句子有很多标点符号:
sent1 = 'aston martin is a car brand'
sent2 = 'audi is a car brand'
sent3 = 'bmw is a car brand'
sent4 = 'jaguar, aston martin, mini cooper are british car brand'
sent5 = 'In all brand, I love jaguar, aston martin and mini cooper'
...
那么输出如下:
from gensim.models import Phrases
sents = [sent1, sent2, sent3, sent4, sent5, ...]
sents_stream = [sent.split() for sent in sents]
bigram = Phrases(sents_stream)
for sent in sents:
print(bigram [sent])
# Ouput should be like:
['aston', 'martin', 'car', 'brand']
['audi', 'car', 'brand']
['bmw', 'car', 'brand']
['jaguar', 'aston', 'martin_mini', 'cooper', 'british', 'car', 'brand']
['all', 'brand', 'love', 'jaguar', 'aston', 'martin_mini', 'cooper']
...
在这种情况下,我应该如何处理带有大量标点符号的句子以防止martin_mini大小写并使输出看起来像:
['aston', 'martin', 'car', 'brand']
['audi', 'car', 'brand']
['bmw', 'car', 'brand']
['jaguar', 'aston_martin', 'mini_cooper', 'british', 'car', 'brand'] # Change
['all', 'brand', 'love', 'jaguar', 'aston_martin', 'mini_cooper'] # Change
...
非常感谢您的帮助!
【问题讨论】: