我的想法是用句子中对应的 pos_tags 替换单词并形成如下的新属性:
sentence = ["A quick brown fox jumped over the cat",
"An apple fell from a tree",
"I like old western classics"]
tokenized_sents = [nltk.word_tokenize(i) for i in sentence]
print(tokenized_sents)
pos_tags = [nltk.pos_tag(token) for token in tokenized_sents]
print(pos_tags)
[[('A', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('cat', 'NN')], [('An', 'DT'), ('apple', 'NN'), ('fell', 'VBD'), ('from', 'IN'), ('a', 'DT'), ('tree', 'NN')], [('I', 'PRP'), ('like', 'VBP'), ('old', 'JJ'), ('western', 'JJ'), ('classics', 'NNS')]]
现在通过用 pos_tags 替换句子中的单词,从 pos_tags 创建词向量。
# from gensim.test.utils import common_texts
from gensim.models import Word2Vec
pos_tag_list = [['DT', 'JJ', 'NN', 'NN', 'VBD', 'IN', 'DT', 'NN'],
['DT','NN','VBD','IN','DT','NN'],['PRP','VBP','JJ','JJ','NNS']]
w2v_model = Word2Vec(min_count=1,
window=2,
size=30,
sample=1e-5,
alpha=0.01,
min_alpha=0.0007,
negative=0,
workers=2)
w2v_model.build_vocab(pos_tag_list, progress_per=1)
w2v_model.train(pos_tag_list, total_examples=w2v_model.corpus_count, epochs=3, report_delay=1)
# get the vectors for the Pos_tags from w2v_model
my_dict = dict({})
for index, key in enumerate(w2v_model.wv.vocab):
my_dict[key] = w2v_model.wv[key]
# Sample Output vector for pos_tags, we got 30-dimensional word vector since
we used size=30.
{'DT': array([-0.01487986, 0.00341667, 0.00576919, -0.01203213, 0.01111736,
0.01643543, 0.00583243, 0.00283635, -0.00892249, 0.01334178,
0.01324782, 0.00843606, 0.00965199, 0.00849338, -0.00584444,
-0.00482766, 0.01218408, -0.00959254, -0.00172328, 0.01302824,
-0.00374165, -0.01516393, -0.00604865, 0.00170989, 0.00843781,
-0.01403714, 0.00150807, 0.01511062, 0.00798908, 0.0088043 ],
dtype=float32)}
现在对于 pos_tag_list 中的每个条目,将 pos_tags 替换为向量,并为朴素贝叶斯模型创建一个训练数据集。您还可以将实际的词向量与 pos_tags 一起使用,并创建一个综合数据集。我没有专门研究它,但根据我发现的研究,我认为这可能有效。试试看。