【发布时间】:2019-02-14 20:22:40
【问题描述】:
我正在使用 Python 和 BS4 构建我的第一个网络爬虫。我想调查 2018 年 KONA Ironman 世界锦标赛的计时赛数据。将 JSON 转换为 CSV 的最佳方法是什么?
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import json
import requests
sauce =
'http://m.ironman.com/triathlon/events/americas/ironman/world-
championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank,
overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
with open('data.json', 'w') as jsonfile:
json.dump(parse_table(soup), jsonfile)
print(json.dumps(parse_table(soup), indent=3))
JSON 输出包含运动员的姓名,后跟他们的组别、性别和总排名以及游泳、自行车和跑步时间:
{
"Avila, Anthony 2470": [ {
"div_rank": "138", "gender_rank": "1243", "overall_rank": "1565", "swim": "01:20:11", "bike": "05:27:59", "run": "04:31:56"
}
],
"Lindgren, Mikael 1050": [ {
"div_rank": "151", "gender_rank": "872", "overall_rank": "983", "swim": "01:09:06", "bike": "05:17:51", "run": "03:49:20"
}
],
"Umezawa, Kazuyoshi 1870": [ {
"div_rank": "229", "gender_rank": "1589", "overall_rank": "2186", "swim": "01:17:22", "bike": "06:14:45", "run": "07:16:21"
}
],
"Maric, Bojan 917": [ {
"div_rank": "162", "gender_rank": "923", "overall_rank": "1065", "swim": "01:03:22", "bike": "05:13:56", "run": "04:01:45"
}
],
"Nishioka, Maki 2340": [ {
"div_rank": "6", "gender_rank": "52", "overall_rank": "700", "swim": "00:58:40", "bike": "05:19:10", "run": "03:33:58"
}...
}
【问题讨论】:
标签: json python-3.x csv web-scraping beautifulsoup