使用beautifulsoup 将多个.html 转换为单个csv答案

【问题标题】：Multiple .html to single csv with beautifulsoup使用beautifulsoup 将多个.html 转换为单个csv
【发布时间】：2021-09-04 18:47:15
【问题描述】：

我的文件夹中有 13,000 个 html 文件 - 我正在尝试将数据放入单个 csv 文件中。

我相信我已经设法让它大部分工作 - 但是似乎在写入 csv 时遇到问题，无论我尝试什么。

这是我当前的代码：

import re
import csv

from bs4 import BeautifulSoup

path = r'C:/Users/Mx/Testing/Infod'

ext = '.htm'

for filename in os.listdir(path):
    
    if filename.endswith(ext):

        fullpath = os.path.join(path, filename)

        filename = os.path.splitext(os.path.basename(filename))[0]

        soup = BeautifulSoup(open(fullpath, encoding="utf-8"), 'html.parser')

        text = soup.get_text()

        ref = soup.find("td", text="Reference")
        pattern = re.compile(r'GBBTI\S{9}')
        IC = soup.find("b", text="Issuing country")
        cx = IC.findNext("td").contents
        SD = soup.find("b", text="Start date of validity")
        SDX = SD.findNext("td").contents 
        ED = soup.find("b", text="End date of validity")           
        EDX = ED.findNext("td").content
        NC = soup.find("b", text="Nomenclature code")
        NCX = NC.findNext("td").contents        
        CJ = soup.find("b", text="Classification justification")
        CJX = CJ.findNext("td").contents        
        L = soup.find("b", text="Language")
        LX = L.findNext("td").contents        
        POI = soup.find("b", text="Place of issue")
        POIX = POI.findNext("td").contents
        DOI = soup.find("b", text="Date of issue")
        DOIX = DOI.findNext("td").contents
        NAA = soup.find("b", text="Name and adress")
        NAAX = NAA.findNext("td").contents
        DOG = soup.find("b", text="Description of goods")
        DOGX = DOG.findNext("td").contents
        NK = soup.find("b", text="National keywords")
        NKX = NK.findNext("td").contents



        
        with open('names.csv', 'w') as csvfile:
            fieldnames = ['Ref', 'country', 'Start date of Validity', 'End date of validity', 'Nomenclature code', 'Classification justification', 'Language', 'Place of issue', 'Date of issue', 'Name and address', 'Description', 'keywords']
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)



            writer.writerow((soup.find('td', text=pattern)),cx, SDX, EDX, NCX, CJX, LX, POIX, DOIX, NAAX, DOGX, NKX) ```

Any advice would be greatly appreciated.

【问题讨论】：

您面临什么问题？
它不会保存到 csv
您能否发布运行此脚本后 CSV 的样子？
我想你的 CSV 在运行这个脚本后只有两行。 @MxMorrigan

标签： python html csv beautifulsoup

【解决方案1】：

首先的一个问题是，在每一行上，您都打开文件进行写入，实际上是通过以下方式进行的：

open('names.csv', 'w')

在每一行，您都重写文件（删除以前的数据并写入新数据）。为了防止这种情况并加快整个过程，我建议在 for 循环之前打开它一次（不要忘记关闭它）。

另外，由于 csv 是一种非常简单的格式，我不确定使用某种库来操作它是否真的有用，一个如何做到这一点的示例：

import re
import csv

from bs4 import BeautifulSoup

path = r'C:/Users/Mx/Testing/Infod'

ext = '.htm'

out_file = open('names.csv', 'w')
out_file.write(",".join(['Ref', 'country', 'Start date of Validity', 'End date of validity', 'Nomenclature code', 'Classification justification', 'Language', 'Place of issue', 'Date of issue', 'Name and address', 'Description', 'keywords']) + "\n") # Write your keys

for filename in os.listdir(path):
    
    if filename.endswith(ext):

        ...
        get your data
        ...

        out_file.write(",".join([ """pass here your values in corrent order""" ]) + "\n") # write your values separated by
out_file.close()

【讨论】：