如何将 docx 内容放在数据框列中？答案

【问题标题】：How to put docx content in dataframe columns?如何将 docx 内容放在数据框列中？
【发布时间】：2021-07-12 06:29:36
【问题描述】：

下面是我的代码：

if t.endswith('.docx'):
        def get_files(extension, location):
            v_doc = []
            for root, dirs, files in os.walk(location):
                for t in files:
                    if t.endswith(extension):   
                        v_doc.append(t)
            return v_doc
        
        file_list = get_files('.docx', paths)
        #print(file_list)
        index = 0
        for file in file_list:
                index += 1
                doc = Document(file)
                #print(doc)
                column_label = f'column{index}'
                data_content = doc.paragraphs
                final = []
                for f in data_content:
                    final.append(f.text)
                    new = [x for x in final if x]
                    #j = {column_label: new}
                    #print(j)
                    df_last = pd.DataFrame(new, columns= 
                                              [column_label])
                    df_last.to_excel('output_dummy.xlsx')

但我遇到以下问题：

column2:
#hello how are you guys?
#i hope you are all doing fine

预期的数据帧输出：

column1:                                                 column2:
#This column is getting replaced by column 2             #hello how are you guys?

#some random dummy text                                  #i hope you are all doing fine

docx1 包含： #此列将被第 2 列替换 #一些随机的虚拟文本

docx2 内容： #大家好，你们好吗？ #我希望你们一切都好

我知道这是一个愚蠢的问题。我在哪里犯这个错误？

【问题讨论】：

请提供示例 docx 和完整的MRE。另外，您用来打开 *.docx 的库是什么？
您好，感谢您的回复，我已经解决了这个问题，它已经很老了。但是你能在这里检查我的新问题吗？ stackoverflow.com/questions/68413792/…

标签： python pandas dataframe docx

【解决方案1】：

我找到了答案。

Repeat f'column{index}' also for .doc and .excel to

f'column{index+index2}'.

#index2 is for docx or excel like previous one.
for file2 in file_list2:
            file2 = 'datas/'+file2
            index2 += 1
            column_label2 = f'seller{index2}'
            df = pd.read_excel(file2, header=None, index_col=False)
            for l in df.values:
                for s in l:            
                    g.append(s)
                    
                    
        t = [incom for incom in g if str(incom) != 'nan']            
        for s in t:
            final.append({column_label2: s})
            
        index = 0    
        for file in file_list:
            file = 'datas/'+file
            index += 1
            doc = Document(file)
            column_label = f'seller{index+index2}'
            for table in doc.tables:
                for row in table.rows:
                    for cell in row.cells:
                        new_list = [p.text for p in cell.paragraphs if p.text not in ['5','3','0.1%', '1%','1',
                                                                                    'Bill','Number' ]]
                        for s in new_list:
                            final.append({column_label: s})
                                
            y = [d.text for d in doc.paragraphs if d.text not in ['5','3','0.1%', '1%', '1',
                                                                  'Number']]
            for k in y:
                final.append({column_label: k})

【讨论】：