【发布时间】:2016-02-15 19:31:44
【问题描述】:
在我将“trainer”字段添加到scrape之前,下面的代码是可以的。这个字段是html中第二个兄弟的第二部分,代表Line2。其他字段代表源代码中的第1行。我得到了需要 189 行代码,但是当我包含提取训练器的代码时,我只能得到每场比赛中的最后一只狗(不包括所有其他 5 只狗)。这只有 18 行代码。由于某种原因,BS 无法正常工作使用循环。包含 trainer 字段正在破坏 rows.append.Here is the url http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754 这是代码
import csv
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754").text
soup = BeautifulSoup(html,'lxml')
rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
track = header.find("div", class_="track").get_text(strip=True).encode('ascii', 'ignore').strip("|")
date = header.find("div", class_="date").get_text(strip=True).encode('ascii', 'ignore').strip("|")
datetime = header.find("div", class_="datetime").get_text(strip=True).encode('ascii', 'ignore').strip("|")
grade = header.find("div", class_="grade").get_text(strip=True).encode('ascii', 'ignore').strip("|")
distance = header.find("div", class_="distance").get_text(strip=True).encode('ascii', 'ignore').strip("|")
prizes = header.find("div", class_="prizes").get_text(strip=True).encode('ascii', 'ignore').strip("|")
results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1")
for result in results:
fin = result.find("li", class_="fin").get_text(strip=True)
greyhound = result.find("li", class_="greyhound").get_text(strip=True)
trap = result.find("li", class_="trap").get_text(strip=True)
sp = result.find("li", class_="sp").get_text(strip=True)
timeSec = result.find("li", class_="timeSec").get_text(strip=True)
timeDistance = result.find("li", class_="timeDistance").get_text(strip=True)
results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line2")
for result in results:
trainer = result.find("li", class_="trainer").get_text(strip=True)
rows.append({
"track": track,
"date": date,
"greyhound": greyhound,
"datetime":datetime,
"sp" :sp,
"grade":grade,
"distance":distance,
"prizes":prizes,
"timeSec":timeSec,
"timeDistance":timeDistance,
"trap":trap,
"fin":fin,
"trainer":trainer
})
with open("greyfile.csv", "w") as f:
writer = csv.DictWriter(f, ["track","date","trap","fin","greyhound","datetime","sp","grade","distance","prizes","timeSec","timeDistance","trainer"])
for row in rows:
writer.writerow(row)
【问题讨论】:
-
我复制了你的代码,它运行没有问题,把所有东西都放在了 CSV 文件中,包括 trainer。 (Python 2.7, bs4 with lxml, Ubuntu 15.10) 第一行示例
Sheffield,02/02/16,4,6,Unique Boycie,18:39,3/1,A4,500m,"1st 100, Others 30 Race Total 250",4.34,30.22 (1 3/4),(Trainer:J D Davy)你遇到了什么错误? -
HI CasualDemon。你在csv文件中得到了多少行。应该有186行反映了场地的所有比赛,但我只得到16行。它只包括细节每场比赛的最后一只狗(包括训练师)。
-
啊,我明白了,我只有 16 行。感谢您的澄清,我认为引发了异常或其他问题。我会进一步研究。
标签: python html loops web-scraping beautifulsoup