抓取 Training.gov.au 的表格答案

【问题标题】：Scraping Training.gov.au of tables抓取 Training.gov.au 的表格
【发布时间】：2019-06-18 07:13:25
【问题描述】：

我正在尝试自动化我的一些工作。有问题的网站是 training.gov.au，它们在特定页面下嵌套表格，例如https://training.gov.au/Training/Details/BSBWHS402 我真正想做的是能够指出我想使用哪个模块（在本例中为 BSBWHS402）并遍历嵌套在该页面上的特定表，然后将这些表重新加工成 .csv 或理想情况下工作成预格式化的 .csv 文件。文档

我已经能够通过扼杀其他人的工作从代码中获得我需要的东西，但无法让它看起来与表格中的网站相似。我尝试将其粘贴到 .csv 中并使用分隔符，但这不起作用，显然并没有真正实现自动化。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
website_url = requests.get('https://training.gov.au/Training/Details/BSBWHS402').text
soup = BeautifulSoup(website_url,'lxml')
tables = soup.findAll('table')
My_table = soup.find('Elements and Performance Criteria')
df = pd.read_html(str(tables))
results = (df[8].to_json(orient='records'))
print(results)

我得到以下单行；

[{"0":"ELEMENT","1":"PERFORMANCE CRITERIA"},{"0":"Elements describe the essential outcomes.","1":"Performance criteria describe the performance needed to demonstrate achievement of the element."},{"0":"1 Assist with determining the legal framework for WHS in the workplace","1":"1.1 Access current WHS legislation and related documentation relevant to the organisation\u2019s operations 1.2 Use knowledge of the relationship between WHS Acts, regulations, codes of practice, standards and guidance material to assist with determining legal requirements in the workplace 1.3 Assist with identifying and confirming the duties, rights and obligations of individuals and parties as specified in legislation 1.4 Assist with seeking advice from legal advisers where necessary"},{"0":"2 Assist with providing advice on WHS compliance","1":"2.1 Assist with providing advice to individuals and parties about their legal duties, rights and obligations, and the location of relevant information in WHS legislation 2.2 Assist with providing advice to individuals and parties about the functions and powers of the WHS regulator and how they are exercised, and the objectives and principles underpinning WHS"},{"0":"3 Assist with WHS legislation compliance measures","1":"3.1 Assist with assessing how the workplace complies with relevant WHS legislation 3.2 Assist with determining the WHS training needs of individuals and parties, and with providing training to meet legal and other requirements 3.3 Assist with developing and implementing changes to workplace policies, procedures, processes and systems that will achieve compliance"}]

我不确定如何准确地使用它，但我至少可以注意到它已经分配了它应该放在哪一列。

非常愿意接受有关如何使该产品变得更好的批评和想法。我将为此制作一个 UI 以输入模块名称，但这是我未来的问题。提前致谢

【问题讨论】：

那么这样的输出到底有什么问题呢？它输出 JSON 格式。里面有一个行数组。尝试将此字符串粘贴到任何 JSON 查看器中。例如这里jsoneditoronline.org
我不一定想要 JSON 格式的，这是我发现的工作方式。我不知道如何从这个 JSON 到 .csv 另外，当我仔细观察它时，它会将一些数据整理在一起，因为所有 1.1、1.2、1.3 都在同一个数据集中。在网站上，这些是表格中的单独行。

标签： python-3.x csv web-scraping beautifulsoup

【解决方案1】：

代替

df[8].to_json

使用

df[8].to_csv

你会得到你想要的。

为了保留新行，您将不得不使用其他库，例如 lxml 而不是 pandas，因为 pd.read_html 标准化了内容。请参阅 pandas github 上的 this issue。

这里是example with BeautifulSoup：

from bs4 import BeautifulSoup
import csv
website_url = requests.get('https://training.gov.au/Training/Details/BSBWHS402').text
soup = BeautifulSoup(website_url,'lxml')
# The string argument is new in Beautiful Soup 4.4.0.
# In earlier versions it was called text.
table = (soup.find("h2", string="Elements and Performance Criteria")).find_next('table')

output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)

with open('output.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(output_rows)
    csvfile.flush()

【讨论】：

谢谢，请再做一件事来帮助出发。该行逐字打印1.1 Access current WHS legislation and related documentation relevant to the organisationâ€™s operations 1.2 Use knowledge of the relationship between WHS Acts, regulations, codes of practice, standards and guidance material to assist with determining legal requirements in the workplace 1.3 Assist with identifying and confirming the duties, rights and obligations of individuals and parties as specified in legislation 无论如何要分开1.1、1.2、1.3？并纠正波动？
我觉得我应该知道这一点，但是html = open("table.html").read() 应该直接指向该站点吗？我尝试将网站输入“”而不是“table.html”，但没有奏效。
更新了示例以匹配您的应用程序
这就是我之前遇到的问题。该站点的构建使得标题“元素和性能标准”不是表的父级。所以当我运行脚本时，它会启动table = (soup.find('Elements and Performance Criteria')).find_next('table') AttributeError: 'NoneType' object has no attribute 'find_next'
最后一件事，我一直收到错误TypeError: a bytes-like object is required, not 'str' 我一直在尝试自己修复它，但发现我在兜圈子。谢谢，