我可以将 Python pandas 数据框用于 NLP 语料库或文档吗？答案

【问题标题】：Can I use Python pandas dataframe to NLP corpus or documentation?我可以将 Python pandas 数据框用于 NLP 语料库或文档吗？
【发布时间】：2020-12-29 04:14:50
【问题描述】：

我想试试这个模型 doc_to_vec 作为我的实验

http://tutorialspoint.com/gensim/gensim_doc2vec_model.htm

我想将我的数据集转换为语料库作为训练数据集并应用 Gensim 模型。

这是我的数据集链接

https://drive.google.com/file/d/1S80I_5zkjJfeTzby7OjIqrs1vMJI6jVo/view?usp=sharing

我已经提到了这个 StackOverflow 问题，但无法解决

How to create corpus from pandas data frame to operate with NLTK

你也可以在这里查看我的代码 google colab

https://colab.research.google.com/drive/1BmBNrfsxQ0AIJH_1hfMaMAceQLh2Xk7Q?usp=sharing

import pandas as pd
dataset = pd.read_csv('ADL_Two_column_MoCo.csv',encoding = 'unicode_escape')
dataset = dataset.dropna()

import gensim
def tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
        yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])


data = [dataset]
data
data_for_training = list(tagged_document(data))

model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)

model.build_vocab(data_for_training)

model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)

len(data_for_training)
1

data_for_training

    [TaggedDocument(words=                                       Smile Canonical              Column  \
 0                         C1=CC=C(C=C1)C2OC(C(O2)CO)CO        CHIRALPAK AD   
 1                     C1=CC=C(C=C1)C(C(C2=CC=CC=C2)O)O        CHIRALPAK AD   
 2                        CC(C1=CC=C(C=C1)C2=CC=CC=C2)O        CHIRALPAK AD   
 5    CC(C1=CC=CC=C1)OC(=O)C2=CC(=CC(=C2)[N+](=O)[O-...        CHIRALPAK AD   
 6       C1=CC=C2C(=C1)C=CC(=C2C3=C(C=CC4=CC=CC=C43)O)O        CHIRALPAK AD   
 ..                                                 ...                 ...   
 839             C1CC(=O)NC(=O)C1N2C(=O)C3=CC=CC=C3C2=O  CHROMEGACHIRAL CCJ   
 840              CC(C1=CC=C(S1)C(=O)C2=CC=CC=C2)C(=O)O  CHROMEGACHIRAL CCJ   
 841  CCC(COC(=O)C1=CC(=C(C(=C1)OC)OC)OC)(C2=CC=CC=C...  CHROMEGACHIRAL CCJ   
 842  CCC(COC(=O)C1=CC(=C(C(=C1)OC)OC)OC)(C2=CC=CC=C...  CHROMEGACHIRAL CCJ   
 843  CCC(COC(=O)C1=CC(=C(C(=C1)OC)OC)OC)(C2=CC=CC=C...  CHROMEGACHIRAL CCJ  




                                      Mobile phase  
 0                                        methanol  
 1                              n-hexane / ethanol  
 2                            water / acetonitrile  
 5                                        methanol  
 6                           n-hexane / 2-propanol  
 ..                                            ...  
 839                                      methanol  
 840  n-hexane / 2-propanol / trifluoroacetic acid  
 841         n-heptane / 2-propanol / diethylamine  
 842                         n-hexane / 2-propanol  
 843                       methanol / diethylamine  
 
 [828 rows x 3 columns], tags=[0])]

这是我得到的值。

我收到此错误

 RuntimeError                              Traceback (most recent call last)
<ipython-input-45-72344a512bb5> in <module>
----> 1 model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)

C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\doc2vec.py in train(self, documents, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, callbacks)
    555             sentences=documents, corpus_file=corpus_file, total_examples=total_examples, total_words=total_words,
    556             epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
--> 557             queue_factor=queue_factor, report_delay=report_delay, callbacks=callbacks, **kwargs)
    558 
    559     @classmethod

C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in train(self, sentences, corpus_file, total_examples, total_words, epochs, start_alpha, end_alpha, word_count, queue_factor, report_delay, compute_loss, callbacks, **kwargs)
   1065             total_words=total_words, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count,
   1066             queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks,
-> 1067             **kwargs)
   1068 
   1069     def _get_job_params(self, cur_epoch):

C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in train(self, data_iterable, corpus_file, epochs, total_examples, total_words, queue_factor, report_delay, callbacks, **kwargs)
    533             epochs=epochs,
    534             total_examples=total_examples,
--> 535             total_words=total_words, **kwargs)
    536 
    537         for callback in self.callbacks:

C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in _check_training_sanity(self, epochs, total_examples, total_words, **kwargs)
   1171 
   1172         if not self.wv.vocab:  # should be set by `build_vocab`
-> 1173             raise RuntimeError("you must first build vocabulary before training the model")
   1174         if not len(self.wv.vectors):
   1175             raise RuntimeError("you must initialize vectors before training the model")

RuntimeError: you must first build vocabulary before training the model

虽然我已经做了词汇，但是数据框中的问题。

【问题讨论】：

'TutorialsPoint' 的那个页面是一个糟糕的入门教程——它甚至没有使用逻辑文档，只是从大量 text8 数据集中拼凑起来的单词。包含 Gensim 文档的小演示 - radimrehurek.com/gensim/auto_examples/tutorials/… - 是一个更好的起点。如果您无法根据数据调整一些教程，您的问题应该更清楚地描述您尝试过的内容以及您遇到的错误/阻止步骤。简单地说“行不通”并不能真正显示出足够的细节或努力让回答者能够提供帮助。
@gojomo 我已经编辑了我的问题并把我的实验看看如果你能帮忙的话......
您应该为您的错误显示整个错误消息（带有回溯堆栈）。此外，您的代码使用了一个从未定义过的变量data，所以我什至看不到它如何在没有早期错误的情况下到达Doc2Vec 相关行。您确定您显示的 cod 会触发您报告的错误吗？
@gojomo 我已经纠正了这个错误并添加了我的 google colab 代码链接，您可以在其中找到我的逻辑！
您是否查看过data_for_training 变量的内容以确保它包含您所期望的内容？如果您在 INFO 级别启用日志记录并观察日志记录输出，那么前面的步骤（train() 之前）是否按预期工作？（如果您的语料库是空的或以其他方式使其看起来是空的，就会出现这种错误。）

标签： python pandas dataframe gensim corpus

【解决方案1】：

这里有一个非常好的doc2vec资源：https://towardsdatascience.com/how-to-vectorize-text-in-dataframes-for-nlp-tasks-3-simple-techniques-82925a5600db

#1. the text must have spaces before and after the =.
#2. word tokenize the doc generating a list of tokenized docs
#3. doc2bow to vectorize the doc into a list of a list
#4. store the corpus in a dataframe column of type object.

 from gensim.corpora.dictionary import Dictionary
 from nltk.tokenize import word_tokenize

 df=pd.read_csv('smile.csv')

 df['corpus']=np.empty
 df['corpus']=df['corpus'].astype(object)
 for key,row in df.iterrows():
     doc=str(row['Smile Canonical'])
     doc=doc.replace('=',' = ')

     tokenized_docs=[word_tokenize(doc.lower())]
     #print(dictionary.token2id)
     dictionary=Dictionary(tokenized_docs)
     corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
     #print(corpus)
     df.loc[key,'corpus']=corpus

print("每个文档都被转换成一个词袋，表示每个token出现的频率") 打印(df.head())

【讨论】：