如何使用 BeautifulSoup 解析多个表并将它们保存到 csv 文件答案

【问题标题】：How to parse multiple tables using BeautifulSoup and save them to a csv file如何使用 BeautifulSoup 解析多个表并将它们保存到 csv 文件
【发布时间】：2017-05-04 08:24:54
【问题描述】：

我有一个程序，快完成了，它只是缺少我正在努力的最后一部分。我需要从很多网页上报废（如果您需要查看示例，您需要访问此站点http://www.pa.org.mt/page.aspx?n=63C70E73&CaseType=PA 并用 03732 填写案例编号，用 16 填写案例年份，然后单击第一个提交。） contentHolder div 并将它们写入 csv 文件，以获得如下内容：案例状态，可用状态，案例编号，PA/03732/16 开发地点：40 .... 一个网页上的所有表格和很多网页都是这样的。我写了一些代码试图做到这一点，但它不起作用，当我运行它时，它会在 csv 文件上输出：https://gyazo.com/6557ac08ad5613a24b5432bfd9e4f2e6 它甚至没有完成所有页面，因为它在中间返回一个错误：

Traceback (most recent call last):
  File "C:\PROJECT\pdfs\converterpluspa.py", line 93, in <module>
    csv.writer(f).writerow(answer)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 0: ordinal not in range(128)

这是我的程序到目前为止的全部代码：

import shlex
import subprocess
import os
import platform
from bs4 import BeautifulSoup
import re
import csv
import pickle
import requests
from robobrowser import RoboBrowser
import codecs

def rename_files():
    file_list = os.listdir(r"C:\\PROJECT\\pdfs")
    print(file_list)
    saved_path = os.getcwd()
    print('Current working directory is '+saved_path)
    os.chdir(r'C:\\PROJECT\\pdfs')
    for file_name in file_list:
        os.rename(file_name, file_name.translate(None, " "))
    os.chdir(saved_path)
rename_files()

def run(command):
     if platform.system() != 'Windows':
         args = shlex.split(command)
    else:
        args = command
    s = subprocess.Popen(args,
                          stdout=subprocess.PIPE,
                         stderr=subprocess.PIPE)
    output, errors = s.communicate()
    return s.returncode == 0, output, errors

# Change this to your PDF file base directory
base_directory = 'C:\\PROJECT\\pdfs'
if not os.path.isdir(base_directory):
    print "%s is not a directory" % base_directory
    exit(1)
 # Change this to your pdf2htmlEX executable location
bin_path = 'C:\\Python27\\pdfminer-20140328\\tools\\pdf2txt.py'
if not os.path.isfile(bin_path):
    print "Could not find %s" % bin_path
    exit(1)
for dir_path, dir_name_list, file_name_list in os.walk(base_directory):
    for file_name in file_name_list:
        # If this is not a PDF file
        if not file_name.endswith('.pdf'):
            # Skip it
            continue
        file_path = os.path.join(dir_path, file_name)
        # Convert your PDF to HTML here
        args = (bin_path, file_name, file_path)
        success, output, errors = run("python %s -o %s.html %s " %args)
        if not success:
            print "Could not convert %s to HTML" % file_path
            print "%s" % errors
htmls_path = 'C:\\PROJECT'
with open ('score.csv', 'w') as f:
    writer = csv.writer(f)
    for dir_path, dir_name_list, file_name_list in os.walk(htmls_path):
        for file_name in file_name_list:
            if not file_name.endswith('.html'):
                continue
            with open(file_name) as markup:
                soup = BeautifulSoup(markup.read())
                text = soup.get_text()
                match = re.findall("PA/(\S*)", text)#To remove the names that appear, just remove the last (\S*), to add them is just add the (\S*), before it there was a \s*
                print(match)
                writer.writerow(match)
                 for item in match:
                    data = item.split('/')
                    case_number = data[0]
                    case_year = data[1]

                browser = RoboBrowser()
                browser.open('http://www.pa.org.mt/page.aspx?n=63C70E73&CaseType=PA')
                form = browser.get_forms()[0]  # Get the first form on the page
                form['ctl00$PageContent$ContentControl$ctl00$txtCaseNo'].value = case_number
                form['ctl00$PageContent$ContentControl$ctl00$txtCaseYear'].value = case_year

                browser.submit_form(form, submit=form['ctl00$PageContent$ContentControl$ctl00$btnSubmit'])

                # Use BeautifulSoup to parse this data
                answer = browser.response.text
                print(answer)
                soup = BeautifulSoup(answer)
                #print soup.prettify()
                status = soup.select('#Table1')
                print (status)
                with codecs.open('file_output.csv', 'a', encoding ='utf-8') as f:
                  for tag in soup.select("#Table1"):
                    csv.writer(f).writerow(answer)

编辑：我试图将最后一行更改为csv.writer(f).writerow(answer.encode("utf-8"))，但没有成功，它打印了另一条错误消息：

Traceback (most recent call last):
  File "C:\PROJECT\pdfs\converterpluspa.py", line 93, in <module>
    csv.writer(f).writerow(answer.encode("utf-8"))
  File "C:\Python27\lib\codecs.py", line 706, in write
    return self.writer.write(data)
  File "C:\Python27\lib\codecs.py", line 369, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25496: ordinal not in range(128)

最终的 csv 文件没有任何改变。

【问题讨论】：

标签： python html csv beautifulsoup

【解决方案1】：

您需要使用 UTF-8 对输出进行编码。将最后一行更改为：

csv.writer(f, encoding="utf-8").writerow(answer.encode("utf-8"))

还将导入从 import csv 更改为 import unicodecsv as csv

【讨论】：

我试过了，但没有用我更新了我的问题，以便您更好地了解哪里出了问题
我已更新我的答案以反映您发布的错误。请立即尝试。
现在它说 unicodecsv 不是一个模块，它不存在，记住我使用的是 python 2.7