以 <div> 格式从网页中抓取表格 - 使用 Beautiful Soup答案

【问题标题】：Scrape table from webpage when in <div> format - using Beautiful Soup以 <div> 格式从网页中抓取表格 - 使用 Beautiful Soup
【发布时间】：2018-12-15 03:14:55
【问题描述】：

所以我的目标是在使用搜索栏遍历许可证代码列表后从网站上抓取 2 个表格（不同格式） - https://info.fsc.org/details.php?id=a0240000005sQjGAAU&type=certificate。我还没有完全包含循环，但为了完整起见，我将它添加到顶部。

我的问题是，因为我想要的两个表，产品数据和证书数据是两种不同的格式，所以我必须分别抓取它们。由于产品数据在网页上采用正常的“tr”格式，因此这一点很简单，我已经设法提取了一个 CSV 文件。更难的是提取证书数据，因为它是“div”形式。

我已经设法使用类函数将证书数据打印为文本列表，但是我需要将其以表格形式保存在 CSV 文件中。如您所见，我尝试了几种将其转换为 CSV 的不成功方法，但如果您有任何建议，将不胜感激，谢谢！！此外，任何其他改进我的代码的一般技巧也会很棒，因为我是网络抓取的新手。

#namelist = open('example.csv', newline='', delimiter = 'example')
#for name in namelist:
    #include all of the below

driver = webdriver.Chrome(executable_path="/Users/jamesozden/Downloads/chromedriver")
url = "https://info.fsc.org/certificate.php"
driver.get(url)

search_bar = driver.find_element_by_xpath('//*[@id="code"]')
search_bar.send_keys("FSC-C001777")
search_bar.send_keys(Keys.RETURN)
new_url = driver.current_url

r = requests.get(new_url)
soup = BeautifulSoup(r.content,'lxml')
table = soup.find_all('table')[0] 
df, = pd.read_html(str(table))
certificate = soup.find(class_= 'certificatecl').text
##certificate1 = pd.read_html(str(certificate))

driver.quit()

df.to_csv("Product_Data.csv", index=False)
##certificate1.to_csv("Certificate_Data.csv", index=False)

#print(df[0].to_json(orient='records'))
print certificate

输出：

Status
Valid
First Issue Date
2009-04-01
Last Issue Date
2018-02-16
Expiry Date
2019-04-01
Standard
FSC-STD-40-004 V3-0

我想要的但超过数百/数千个许可证代码（我只是在 Excel 中手动创建了这个示例）：

Desired output

编辑

因此，虽然这现在适用于证书数据，但我还想抓取产品数据并将其输出到另一个 .csv 文件中。但是目前它只打印了 5 份产品数据的最终许可证代码，这不是我想要的。

新代码：

df = pd.read_csv("MS_License_Codes.csv")
codes = df["License Code"]

def get_data_by_code(code):
    data = [
        ('code', code),
        ('submit', 'Search'),
    ]

    response = requests.post('https://info.fsc.org/certificate.php', data=data)
    soup = BeautifulSoup(response.content, 'lxml')

    status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
    first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
    last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
    expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
    standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text


    return [code, status, first_issue_date, last_issue_date, expiry_date, standard]

# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'Certificate_Data.csv'
#codes = ['C001777', 'C001777', 'C001777', 'C001777']


df3=pd.DataFrame()


with open(OUTPUT_FILE_NAME, 'w') as f:
    writer = csv.writer(f)
    for code in codes:
        print('Getting code# {}'.format(code))
        writer.writerow((get_data_by_code(code)))
        table = soup.find_all('table')[0] 
        df1, = pd.read_html(str(table))
        df3 = df3.append(df1) 

df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')

【问题讨论】：

你能添加你的预期输出吗？
@Selçuk 完成！我已将其添加到我原来的问题中。
@Selçuk 我不确定你是否想要我的实际输出，所以我添加了一个我想要的格式示例，我刚刚手动创建了一个

标签： html selenium web-scraping beautifulsoup scrapy

【解决方案1】：

这里有你需要的。没有铬驱动程序。没有熊猫。在抓取的情况下忘记它。

import requests
import csv
from bs4 import BeautifulSoup

# This is all what you need for your task. Really.
# No chromedriver. Don't use it for scraping. EVER.
# No pandas. Don't use it for writing csv. It's not what pandas was made for.

#Function to parse single data page based on single input code.
def get_data_by_code(code):

    # Parameters to build POST-request. 
    # "type" and "submit" params are static. "code" is your desired code to scrape.
    data = [
        ('type', 'certificate'),
        ('code', code),
        ('submit', 'Search'),
    ]

    # POST-request to gain page data.
    response = requests.post('https://info.fsc.org/certificate.php', data=data)
    # "soup" object to parse html data.
    soup = BeautifulSoup(response.content, 'lxml')

    # "status" variable. Contains first's found [LABEL tag, with text="Status"] following sibling DIV text. Which is status.
    status = soup.find_all("label", string="Status")[0].find_next_sibling('div').text
    # Same for issue dates... etc.
    first_issue_date = soup.find_all("label", string="First Issue Date")[0].find_next_sibling('div').text
    last_issue_date = soup.find_all("label", string="Last Issue Date")[0].find_next_sibling('div').text
    expiry_date = soup.find_all("label", string="Expiry Date")[0].find_next_sibling('div').text
    standard = soup.find_all("label", string="Standard")[0].find_next_sibling('div').text

    # Returning found data as list of values.
    return [response.url, status, first_issue_date, last_issue_date, expiry_date, standard]

# Just insert here output filename and codes to parse...
OUTPUT_FILE_NAME = 'output.csv'
codes = ['C001777', 'C001777', 'C001777', 'C001777']

with open(OUTPUT_FILE_NAME, 'w') as f:
    writer = csv.writer(f)
    for code in codes:
        print('Getting code# {}'.format(code))

        #Writing list of values to file as single row.
        writer.writerow((get_data_by_code(code)))

这里的一切都非常简单。我建议您花一些时间在 Chrome 开发工具的“网络”选项卡上，以更好地了解请求伪造，这是抓取任务的必要条件。

一般来说，你不需要运行chrome来点击“搜索”按钮，你需要伪造这个点击产生的请求。任何形式和 ajax 都一样。

【讨论】：

非常感谢，现在运行得更快了！一个问题，你介意解释你的函数的数据部分吗？我试图了解您所做的事情，但无法完全弄清楚“类型”、“证书”等的用途。它只是请求功能的必要参数吗？
用一些 cmets 更新了我的答案。
我还有一个问题，我还想抓取页面上的表格，它在我的代码中使用 table 和 df 变量（请参阅已编辑的问题）。虽然这适用于单个页面，但它似乎给了我一个我似乎无法解决的错误。
我问了另一个问题，因为它可能更容易看到我的问题，我将不胜感激！

【解决方案2】：

嗯...你应该提高你的技能（：

df3=pd.DataFrame()

with open(OUTPUT_FILE_NAME, 'w') as f:
    writer = csv.writer(f)
    for code in codes:
        print('Getting code# {}'.format(code))
        writer.writerow((get_data_by_code(code)))
        ### HERE'S THE PROBLEM:
        # "soup" variable is declared inside of "get_data_by_code" function.
        # So you can't use it in outer context.
        table = soup.find_all('table')[0] # <--- you should move this line to 
        #definition of "get_data_by_code" function and return it's value somehow...
        df1, = pd.read_html(str(table))
        df3 = df3.append(df1) 

df3.to_csv('Product_Data.csv', index = False, encoding='utf-8')

根据示例，您可以从“get_data_by_code”函数返回值字典：

 def get_data_by_code(code):
 ...
     table = soup.find_all('table')[0]
     return dict(row=row, table=table)

【讨论】：