【发布时间】:2018-04-22 09:18:37
【问题描述】:
我一直在寻找从在线石油生产 SSRS 提要中抓取 HTML 表格。我已经设法学习了一些漂亮的汤/蟒蛇来达到我目前的目的,但我认为我需要一些帮助才能完成它。
目的是抓取所有标记的表并输出json数据。我有一个 json 格式的输出,但对于 10 个标题,但每个标题重复相同的数据行单元格值。我认为通过单元格进行迭代以分配给标题是问题所在。我相信它在运行时会有意义。
任何帮助将不胜感激,试图了解我做错了什么,因为这对我来说很新。
干杯
import json
from bs4 import BeautifulSoup
import urllib.request
import boto3
import botocore
#Url to scrape
url='http://factpages.npd.no/ReportServer?/FactPages/TableView/
field_production_monthly&rs:Command=Render&rc:Toolbar=
false&rc:Parameters=f&Top100=True&IpAddress=108.171.128.174&
CultureCode=en'
#Agent detail to prevent scraping bot detection
user_agent = 'Mozilla/5(Macintosh; Intel Mac OS X 10_9_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47
Safari/537.36'
header = {'User-Agent': user_agent}
#Request url from list above, assign headers from criteria above
req = urllib.request.Request(url, headers = header)
#Open url from the previous request and assign
npddata = urllib.request.urlopen(req, timeout = 20)
#Start soup on url request data
soup = BeautifulSoup(npddata, 'html.parser')
# Scrape the html table variable from selected website
table = soup.find('table')
headers = {}
col_headers = soup.findAll('tr')[3].findAll('td')
for i in range(len(col_headers)):
headers[i] = col_headers[i].text.strip()
# print(json.dumps(headers, indent = 4))
cells = {}
rows = soup.findAll('td', {
'class': ['a61cl', 'a65cr', 'a69cr', 'a73cr', 'a77cr', 'a81cr', 'a85cr',
'a89cr', 'a93cr', 'a97cr']})
for row in rows[i]: #remove index!(###ISSUE COULD BE HERE####)
# findall function was original try (replace getText with FindAll to try)
cells = row.getText('div')
# Attempt to fix, can remove and go back to above
#for i in range(len(rows)): #cells[i] = rows[i].text.strip()
#print(cells)# print(json.dumps(cells, indent = 4))
#print(cells)# print(json.dumps(cells, indent = 4))
data = []
item = {}
for index in headers:
item[headers[index]] = cells#[index]
# if no getText on line 47 then.text() here### ISSUE COULD BE HERE####
data.append(item)
#print(data)
print(json.dumps(data, indent = 4))
# print(item)#
print(json.dumps(item, indent = 4))
【问题讨论】:
-
缩进在 Python 中很重要,请确保您的代码示例具有正确的缩进。
标签: python json beautifulsoup