【发布时间】:2016-05-09 03:13:32
【问题描述】:
我正在使用预训练的 Google 新闻数据集通过 Python 中的 Gensim 库获取词向量
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
加载模型后,我将训练评论句子单词转换为向量
#reading all sentences from training file
with open('restaurantSentences', 'r') as infile:
x_train = infile.readlines()
#cleaning sentences
x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train]
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])
在 word2Vec 过程中,我的语料库中的单词出现了很多错误,这些错误不在模型中。问题是我如何重新训练已经预训练的模型(例如 GoogleNews-vectors-negative300.bin'),以便为那些缺失的单词获取词向量。
以下是我尝试过的: 从我训练的句子中训练了一个新模型
# Set values for various parameters
num_features = 300 # Word vector dimensionality
min_word_count = 10 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
downsampling = 1e-3 # Downsample setting for frequent words
sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
# Initialize and train the model (this will take some time)
print "Training model..."
model = gensim.models.Word2Vec(sentences, workers=num_workers,size=num_features, min_count = min_word_count,
window = context, sample = downsampling)
model.build_vocab(sentences)
model.train(sentences)
model.n_similarity(["food"], ["rice"])
成功了!但问题是我的数据集非常小,训练大型模型的资源也很少。
我正在研究的第二种方法是扩展已经训练好的模型,例如 GoogleNews-vectors-negative300.bin。
model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
model.train(sentences)
有没有可能,是不是一个好用的方法,请帮帮我
【问题讨论】:
标签: python nlp gensim word2vec