使用 Pandas Dataframe 在 Gensim LDA 中进行数据处理时出错答案

【问题标题】：Error in Data Processing in Gensim LDA using Pandas Dataframe使用 Pandas Dataframe 在 Gensim LDA 中进行数据处理时出错
【发布时间】：2020-08-29 10:03:32
【问题描述】：

我正在使用 Gensim LDA 进行主题建模。我正在使用 pandas DataFrame 进行处理。但我收到一个错误

TypeError: 解码为 str: 需要一个类似字节的对象，已找到系列

我只需要使用 Pandas 处理数据，输入数据就像（一行）

 PMID           Text
12755608    The DNA complexation and condensation properties
12755609    Three proteins namely protective antigen PA edition
12755610    Lecithin retinol acyltransferase LRAT catalyze

我的代码是

data = pd.read_csv("h1.csv", delimiter = "\t")
data = data.dropna(axis=0, subset=['Text'])
data['Index'] = data.index
data["Text"] = data['Text'].str.replace('[^\w\s]','')
data.head()

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token):
            result.append(lemmatize_stemming(token))
    return result


input_data = data.Text.str.strip().str.split('[\W_]+')
print('\n\n tokenized and lemmatized document: ')
print(preprocess(input_data))

【问题讨论】：

根据错误消息，我猜您的函数需要一个字符串，例如“DNA 络合和凝聚特性”。相反，您正在为函数提供 pandas.Series。如果没有虚拟数据，就很难准确地确定错误出现的位置以及如何解决它
@KenHBS 我已经更新了数据，其余行类似，是的，我需要传递一个字符串，有什么建议吗？

标签： python pandas dataframe gensim lda

【解决方案1】：

试试这个

def preprocess(text):
   result = []
   for token in gensim.utils.simple_preprocess(text):
      if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2:
         result.append(token)
   return result

doc_processed = input_data['Text'].map(preprocess)

dictionary = corpora.Dictionary(doc_processed)
#to prepapre a document term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_processed]

#Lda model
Lda = gensim.models.ldamodel.LdaModel
#Lda model to get the num_topics, number of topic required, 
#passses is the number training do you want to perform
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary, passes=2)
result=ldamodel.print_topics(num_topics=5, num_words=15)

【讨论】：

你能多加一点解释吗，我也有同样的错误，但我使用了我自己的预处理 docode，它适用于训练，但不适用于测试单个看不见的文档。
取决于您的数据，您可以发布您的代码和示例数据吗？