提高 DOC2VEC Gensim 效率答案

【问题标题】：Improving DOC2VEC Gensim efficiency提高 DOC2VEC Gensim 效率
【发布时间】：2020-06-13 10:38:34
【问题描述】：

我正在尝试在标记文档上训练 Gensim Doc2Vec 模型。我有大约 4000000 个文件。以下是我的代码：

import pandas as pd
import multiprocessing
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import logging
from tqdm import tqdm
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
import os
import re



def text_process(text):
    logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt='%H:%M:%S', level=logging.INFO)
    stop_words_lst = ['mm', 'machine', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'first', 'second', 'third', 'plurality', 'one', 'more', 'least', 'at', 'example', 'memory', 'exemplary', 'fourth', 'fifth', 'sixth','a', 'A', 'an', 'the', 'system', 'method', 'apparatus', 'computer', 'program', 'product', 'instruction', 'code', 'configure', 'operable', 'couple', 'comprise', 'comprising', 'includes', 'cm', 'processor', 'hardware']
    stop_words = set(stopwords.words('english'))

    temp_corpus =[]
    text = re.sub(r'\d+', '', text)
    for w in stop_words_lst:
        stop_words.add(w)
    tokenizer = RegexpTokenizer(r'\w+')
    word_tokens = tokenizer.tokenize(text)
    lemmatizer= WordNetLemmatizer()
    for w in word_tokens:
        w = lemmatizer.lemmatize(w)
        if w not in stop_words:
            temp_corpus.append(str(w))
    return temp_corpus

chunk_patent = pd.DataFrame()
chunksize = 10 ** 5
cores = multiprocessing.cpu_count()
directory = os.getcwd()
for root,dirs,files in os.walk(directory):
    for file in files:
       if file.startswith("patent_cpc -"):
           print(file)
           #f=open(file, 'r')
           #f.close()
           for chunk_patent_temp in pd.read_csv(file, chunksize=chunksize):
                #chunk_patent.sort_values(by=['cpc'], inplace=True)
                #chunk_patent_temp = chunk_patent_temp[chunk_patent_temp['cpc'] == "G06K7"]
                if chunk_patent.empty:
                    chunk_patent = chunk_patent_temp
                else:
                    chunk_patent = chunk_patent.append(chunk_patent_temp)
train_tagged = chunk_patent.apply(lambda r: TaggedDocument(words=text_process(r['text']), tags=[r.cpc]), axis=1)
print(train_tagged.values)

if os.path.exists("cpcpredict_doc2vec.model"):
    doc2vec_model = Doc2Vec.load("cpcpredict_doc2vec.model")
    doc2vec_model.build_vocab((x for x in tqdm(train_tagged.values)), update=True)
    doc2vec_model.train(train_tagged, total_examples=doc2vec_model.corpus_count, epochs=50)
    doc2vec_model.save("cpcpredict_doc2vec.model")
else:
    doc2vec_model = Doc2Vec(dm=0, vector_size=300, min_count=100, workers=cores-1)
    doc2vec_model.build_vocab((x for x in tqdm(train_tagged.values)))
    doc2vec_model.train(train_tagged, total_examples=doc2vec_model.corpus_count, epochs=50)
    doc2vec_model.save("cpcpredict_doc2vec.model")

我曾尝试修改 Doc2vec 参数，但没有任何运气。

在相同的数据上，我训练了 Word2vec 模型，与 doc2vec 模型相比，它要准确得多。此外，word2vec 模型的“most_similar”结果与 doc2vec 模型非常不同。

以下是搜索最相似结果的代码：

from gensim.models import Word2Vec
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
import logging
from gensim.models import Doc2Vec
import re

def text_process(text):
    logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt='%H:%M:%S', level=logging.INFO)
    stop_words_lst = ['mm', 'machine', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'first', 'second', 'third', 'example', 'memory', 'exemplary', 'fourth', 'fifth', 'sixth','a', 'A', 'an', 'the', 'system', 'method', 'apparatus', 'computer', 'program', 'product', 'instruction', 'code', 'configure', 'operable', 'couple', 'comprise', 'comprising', 'includes', 'cm', 'processor', 'hardware']
    stop_words = set(stopwords.words('english'))
    #for index, row in df.iterrows():
    temp_corpus =[]
    text = re.sub(r'\d+', '', text)
    for w in stop_words_lst:
        stop_words.add(w)
    tokenizer = RegexpTokenizer(r'\w+')
    word_tokens = tokenizer.tokenize(text)
    lemmatizer= WordNetLemmatizer()
    for w in word_tokens:
        w = lemmatizer.lemmatize(w)
        if w not in stop_words:
            temp_corpus.append(str(w))
    return temp_corpus

model = Word2Vec.load("cpc.model")
print(model.most_similar(positive=['barcode'], topn=30))

model1 = Doc2Vec.load("cpcpredict_doc2vec.model")

pred_tags = model1.most_similar('barcode',topn=10)
print(pred_tags)

进一步，上述的输出引用如下：

[('indicium', 0.36468246579170227), ('symbology', 0.31725651025772095), ('G06K17', 0.29797130823135376), ('dataform', 0.29535001516342163), ('rogue', 0.29372256994247437), ('certification', 0.29178398847579956), ('reading', 0.27675414085388184), ('indicia', 0.27346929907798767), ('Contra', 0.2700084149837494), ('redemption', 0.26682156324386597)]

[('searched', 0.4693435728549957), ('automated', 0.4469209909439087), ('production', 0.4364866018295288), ('hardcopy', 0.42193126678466797), ('UWB', 0.4197841286659241), ('technique', 0.4149003326892853), ('authorized', 0.4134449362754822), ('issued', 0.4129987359046936), ('installing', 0.4093806743621826), ('thin', 0.4016669690608978)]

【问题讨论】：

标签： python nltk gensim word2vec doc2vec

【解决方案1】：

您选择的Doc2Vec 模式dm=0（又名“PV-DBOW”）根本不训练词向量。由于不同模型的共享代码路径，词向量仍将随机初始化，但从未经过训练，因此毫无意义。

因此，使用单词作为查询的most_similar() 的结果基本上是随机的。（在模型本身上使用most_similar()，而不是它的.wv word-vectors 或.docvecs doc-vectors，也应该会生成弃用警告。）

如果除了 doc 向量之外，您还需要 Doc2Vec 模型来训练词向量，请使用 dm=1 模式（“PV-DM”）或 dm=0, dbow_words=1（添加可选的交错跳过语法词训练到普通的 DBOW 训练）。在这两种情况下，单词的训练都与Word2Vec 模型（分别属于“CBOW”或“skip-gram”模式）非常相似——因此，基于单词的most_similar() 结果应该非常相似。

分别：

如果您有足够的数据来训练 300 维向量，并丢弃所有出现次数少于 100 次的单词，那么 50 个训练 epoch可能会超出需要。
那些most_similar() 结果看起来不像是任何词形还原的结果，正如您的text_process() 方法所预期的那样，但也许这不是问题，或者完全不是其他问题。但是请注意，如果有足够的数据，词形还原可能是一个多余的步骤 - 当在实际上下文中有大量不同的单词变体示例时，同一单词的所有变体往往会非常有用地彼此靠近。

【讨论】：

另一个查询我正在使用的数据集中有大约 7000 个唯一标签。 Doc2Vec模型的效率是否会因为大量唯一标签而降低？
只有您可以回答这个问题，因为这取决于您的文本、标签和“效率”的项目特定含义。但请注意，Doc2Vec 的典型/原始描述（作为“段落向量”算法）倾向于为每个文档提供其唯一的 ID 标签，因此如果您遵循该先例，您将拥有 4000000 个标签。在某种程度上，只使用 7000 个唯一标签就像只使用 7000 个唯一虚拟文档（每个都是您所有同标签文档的串联）。这种多样性可能不足以驱动 300 维的文档向量。
那么，为什么 4000000 个不同的文档只有 7000 个文档标签？（每个文档有多大？如果 4000000 个文档中的每一个都没有获得自己独特的文档向量，那么文档向量的最终目标是什么？
我现在明白这个问题了。我试图针对标签训练 doc2vec 向量，然后在其上运行分类算法。然而，这种技术的准确性非常低（~3.2%）。此后，我尝试使用逻辑回归在同一数据集上训练分类器，但准确度再次非常低。此后，我尝试仅在三个标签上训练模型，准确率跃升至 71%。因此，我得出结论，标签的数量可能会降低系统的准确性。
当然，从 7000 个可能的答案中将文本分类到一个正确的类别中比从 3 个中选出一个正确的类别更难。但是很大程度上取决于您的选择，包括算法、参数和预处理.如前所述，如果您只提供 7K 的唯一标签，而不是 4M，那么您实际上会将您的训练集压缩为仅 7K 大的文档，这可能会稀释原始文档的许多有意义的可变性。更高的准确度可能是可能的，但在许多未显示的分类步骤中，无法猜测还有哪些可能是次优的。