【发布时间】:2020-12-29 04:14:50
【问题描述】:
我想试试这个模型 doc_to_vec 作为我的实验
http://tutorialspoint.com/gensim/gensim_doc2vec_model.htm
我想将我的数据集转换为语料库作为训练数据集并应用 Gensim 模型。
这是我的数据集链接
https://drive.google.com/file/d/1S80I_5zkjJfeTzby7OjIqrs1vMJI6jVo/view?usp=sharing
我已经提到了这个 StackOverflow 问题,但无法解决
How to create corpus from pandas data frame to operate with NLTK
你也可以在这里查看我的代码 google colab
https://colab.research.google.com/drive/1BmBNrfsxQ0AIJH_1hfMaMAceQLh2Xk7Q?usp=sharing
import pandas as pd
dataset = pd.read_csv('ADL_Two_column_MoCo.csv',encoding = 'unicode_escape')
dataset = dataset.dropna()
import gensim
def tagged_document(list_of_list_of_words):
for i, list_of_words in enumerate(list_of_list_of_words):
yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])
data = [dataset]
data
data_for_training = list(tagged_document(data))
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
model.build_vocab(data_for_training)
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
len(data_for_training)
1
data_for_training
[TaggedDocument(words= Smile Canonical Column \
0 C1=CC=C(C=C1)C2OC(C(O2)CO)CO CHIRALPAK AD
1 C1=CC=C(C=C1)C(C(C2=CC=CC=C2)O)O CHIRALPAK AD
2 CC(C1=CC=C(C=C1)C2=CC=CC=C2)O CHIRALPAK AD
5 CC(C1=CC=CC=C1)OC(=O)C2=CC(=CC(=C2)[N+](=O)[O-... CHIRALPAK AD
6 C1=CC=C2C(=C1)C=CC(=C2C3=C(C=CC4=CC=CC=C43)O)O CHIRALPAK AD
.. ... ...
839 C1CC(=O)NC(=O)C1N2C(=O)C3=CC=CC=C3C2=O CHROMEGACHIRAL CCJ
840 CC(C1=CC=C(S1)C(=O)C2=CC=CC=C2)C(=O)O CHROMEGACHIRAL CCJ
841 CCC(COC(=O)C1=CC(=C(C(=C1)OC)OC)OC)(C2=CC=CC=C... CHROMEGACHIRAL CCJ
842 CCC(COC(=O)C1=CC(=C(C(=C1)OC)OC)OC)(C2=CC=CC=C... CHROMEGACHIRAL CCJ
843 CCC(COC(=O)C1=CC(=C(C(=C1)OC)OC)OC)(C2=CC=CC=C... CHROMEGACHIRAL CCJ
Mobile phase
0 methanol
1 n-hexane / ethanol
2 water / acetonitrile
5 methanol
6 n-hexane / 2-propanol
.. ...
839 methanol
840 n-hexane / 2-propanol / trifluoroacetic acid
841 n-heptane / 2-propanol / diethylamine
842 n-hexane / 2-propanol
843 methanol / diethylamine
[828 rows x 3 columns], tags=[0])]
这是我得到的值。
RuntimeError Traceback (most recent call last)
<ipython-input-45-72344a512bb5> in <module>
----> 1 model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\doc2vec.py in train(self, documents, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, callbacks)
555 sentences=documents, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
556 epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
--> 557 queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks, **kwargs)
558
559 @classmethod
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in train(self, sentences, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, compute_loss, callbacks, **kwargs)
1065 total_words=total_words, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
1066 queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks,
-> 1067 **kwargs)
1068
1069 def _get_job_params(self, cur_epoch):
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in train(self, data_iterable, corpus_file, epochs, total_examples, total_words, queue_factor, report_delay, callbacks, **kwargs)
533 epochs=epochs,
534 total_examples=total_examples,
--> 535 total_words=total_words, **kwargs)
536
537 for callback in self.callbacks:
C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in _check_training_sanity(self, epochs, total_examples, total_words, **kwargs)
1171
1172 if not self.wv.vocab: # should be set by `build_vocab`
-> 1173 raise RuntimeError("you must first build vocabulary before training the model")
1174 if not len(self.wv.vectors):
1175 raise RuntimeError("you must initialize vectors before training the model")
RuntimeError: you must first build vocabulary before training the model
虽然我已经做了词汇,但是数据框中的问题。
【问题讨论】:
-
'TutorialsPoint' 的那个页面是一个糟糕的入门教程——它甚至没有使用逻辑文档,只是从大量
text8数据集中拼凑起来的单词。包含 Gensim 文档的小演示 - radimrehurek.com/gensim/auto_examples/tutorials/… - 是一个更好的起点。如果您无法根据数据调整一些教程,您的问题应该更清楚地描述您尝试过的内容以及您遇到的错误/阻止步骤。简单地说“行不通”并不能真正显示出足够的细节或努力让回答者能够提供帮助。 -
@gojomo 我已经编辑了我的问题并把我的实验看看如果你能帮忙的话......
-
您应该为您的错误显示整个错误消息(带有回溯堆栈)。此外,您的代码使用了一个从未定义过的变量
data,所以我什至看不到它如何在没有早期错误的情况下到达Doc2Vec相关行。您确定您显示的 cod 会触发您报告的错误吗? -
@gojomo 我已经纠正了这个错误并添加了我的 google colab 代码链接,您可以在其中找到我的逻辑!
-
您是否查看过
data_for_training变量的内容以确保它包含您所期望的内容?如果您在 INFO 级别启用日志记录并观察日志记录输出,那么前面的步骤(train()之前)是否按预期工作? (如果您的语料库是空的或以其他方式使其看起来是空的,就会出现这种错误。)
标签: python pandas dataframe gensim corpus