Python Beautiful Soup 桌刮答案

【问题标题】：Python Beautiful Soup table scrapePython Beautiful Soup 桌刮
【发布时间】：2018-04-22 09:18:37
【问题描述】：

我一直在寻找从在线石油生产 SSRS 提要中抓取 HTML 表格。我已经设法学习了一些漂亮的汤/蟒蛇来达到我目前的目的，但我认为我需要一些帮助才能完成它。

目的是抓取所有标记的表并输出json数据。我有一个 json 格式的输出，但对于 10 个标题，但每个标题重复相同的数据行单元格值。我认为通过单元格进行迭代以分配给标题是问题所在。我相信它在运行时会有意义。

任何帮助将不胜感激，试图了解我做错了什么，因为这对我来说很新。

干杯

import json
from bs4 import BeautifulSoup
import urllib.request
import boto3
import botocore

#Url to scrape

url='http://factpages.npd.no/ReportServer?/FactPages/TableView/
    field_production_monthly&rs:Command=Render&rc:Toolbar=
    false&rc:Parameters=f&Top100=True&IpAddress=108.171.128.174&
    CultureCode=en'


#Agent detail to prevent scraping bot detection 
user_agent = 'Mozilla/5(Macintosh; Intel Mac OS X 10_9_3) 
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 
    Safari/537.36'

header = {'User-Agent': user_agent}

#Request url from list above, assign headers from criteria above
req = urllib.request.Request(url, headers = header)

#Open url from the previous request and assign
npddata = urllib.request.urlopen(req, timeout = 20)

#Start soup on url request data
soup = BeautifulSoup(npddata, 'html.parser')

# Scrape the html table variable from selected website 
table = soup.find('table')


headers = {}

col_headers = soup.findAll('tr')[3].findAll('td')

for i in range(len(col_headers)):
    headers[i] = col_headers[i].text.strip()

# print(json.dumps(headers, indent = 4))


cells = {}

rows = soup.findAll('td', {
    'class': ['a61cl', 'a65cr', 'a69cr', 'a73cr', 'a77cr', 'a81cr', 'a85cr', 
    'a89cr', 'a93cr', 'a97cr']})

for row in rows[i]: #remove index!(###ISSUE COULD BE HERE####)

# findall function was original try (replace getText with FindAll to try)

    cells = row.getText('div')


# Attempt to fix, can remove and go back to above
#for i in range(len(rows)): #cells[i] = rows[i].text.strip()


#print(cells)# print(json.dumps(cells, indent = 4))
#print(cells)# print(json.dumps(cells, indent = 4))


data = []

item = {}

for index in headers:
    item[headers[index]] = cells#[index]

# if no getText on line 47 then.text() here### ISSUE COULD BE HERE####

data.append(item)


#print(data)
print(json.dumps(data, indent = 4))
# print(item)# 
print(json.dumps(item, indent = 4))

【问题讨论】：

缩进在 Python 中很重要，请确保您的代码示例具有正确的缩进。

标签： python json beautifulsoup

【解决方案1】：

您的代码中有一些错误，我修复了这些错误并稍微修改了您的代码：

这是你想要的吗：

import requests
from bs4 import BeautifulSoup
import json

# Webpage connection
html = "http://factpages.npd.no/ReportServer?/FactPages/TableView/field_production_monthly&rs:Command=Render&rc:Toolbar=false&rc:Parameters=f&Top100=True&IpAddress=108.171.128.174&CultureCode=en"
r=requests.get(html)
c=r.content
soup=BeautifulSoup(c,"html.parser")


rows = soup.findAll('td', {
    'class': ['a61cl', 'a65cr', 'a69cr', 'a73cr', 'a77cr', 'a81cr', 'a85cr',
    'a89cr', 'a93cr', 'a97cr']})

headers = soup.findAll('td', {
    'class': ['a20c','a24c', 'a28c', 'a32c', 'a36c', 'a40c', 'a44c', 'a48c',
    'a52c']})

headers_list = [item.getText('div') for item in headers]

rows_list=[item.getText('div') for item in rows]

final=[rows_list[item:item+9] for item in range(0,len(rows_list),9)]

row_header={}
for item in final:
    for indices in range(0,9):
        if headers_list[indices] not in row_header:
            row_header[headers_list[indices]]=[item[indices]]
        else:
            row_header[headers_list[indices]].append(item[indices])



result=json.dumps(row_header,indent=4)
print(result)

输出样本：

{
    "Year": [
        "2009",
        "2009",
        "2009",
        "2009",
        "2009",
        "2009",
        "2010",
        "2010",
        "2010",
        "2010",
        "2010",

【讨论】：

太棒了，非常感谢@ayodhyankit Paul，我稍后会研究一下，看看你修复了什么，好好学习。
是否有一种简单的方法来记录每行？因此，对于每一行都有一个包含 10 个标题的 json 记录，那么下一条记录将再次包含第二行抓取数据的标题？
@Chris 你想要每一行的标题，你能给我举个例子吗？
如果可能的话，类似的东西，不确定是否容易改变： [{ "field" : alva, "year" : "2010", "month" : "3", "Oil" : "3233.", }, { "field" : alva, "year" : "2010", "month" : "3", "Oil" : "4556.", }]