【问题标题】:rows append not working with loops行追加不适用于循环
【发布时间】:2016-02-15 19:31:44
【问题描述】:

在我将“trainer”字段添加到scrape之前,下面的代码是可以的。这个字段是html中第二个兄弟的第二部分,代表Line2。其他字段代表源代码中的第1行。我得到了需要 189 行代码,但是当我包含提取训练器的代码时,我只能得到每场比赛中的最后一只狗(不包括所有其他 5 只狗)。这只有 18 行代码。由于某种原因,BS 无法正常工作使用循环。包含 trainer 字段正在破坏 rows.append.Here is the url http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754 这是代码

import csv
from bs4 import BeautifulSoup
import requests


html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754").text
soup = BeautifulSoup(html,'lxml')

rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
    track = header.find("div",    class_="track").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    date = header.find("div",   class_="date").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    datetime = header.find("div", class_="datetime").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    grade = header.find("div", class_="grade").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    distance = header.find("div", class_="distance").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    prizes = header.find("div", class_="prizes").get_text(strip=True).encode('ascii', 'ignore').strip("|")

    results = header.find_next_sibling("div",  class_="resultsBlock").find_all("ul", class_="line1")
    for result in results:
        fin = result.find("li", class_="fin").get_text(strip=True)
        greyhound = result.find("li", class_="greyhound").get_text(strip=True)
        trap = result.find("li", class_="trap").get_text(strip=True)
        sp = result.find("li", class_="sp").get_text(strip=True)
        timeSec = result.find("li", class_="timeSec").get_text(strip=True)
        timeDistance = result.find("li", class_="timeDistance").get_text(strip=True)


    results = header.find_next_sibling("div",  class_="resultsBlock").find_all("ul", class_="line2")
    for result in results:
         trainer = result.find("li",  class_="trainer").get_text(strip=True)



    rows.append({
            "track": track,
            "date": date,
            "greyhound": greyhound,
            "datetime":datetime,
            "sp" :sp,
            "grade":grade,
            "distance":distance,
            "prizes":prizes,
            "timeSec":timeSec,
            "timeDistance":timeDistance,
            "trap":trap,
            "fin":fin,
            "trainer":trainer

        })




with open("greyfile.csv", "w") as f:
    writer = csv.DictWriter(f,      ["track","date","trap","fin","greyhound","datetime","sp","grade","distance","prizes","timeSec","timeDistance","trainer"])

    for row in rows:
      writer.writerow(row)

【问题讨论】:

  • 我复制了你的代码,它运行没有问题,把所有东西都放在了 CSV 文件中,包括 trainer。 (Python 2.7, bs4 with lxml, Ubuntu 15.10) 第一行示例Sheffield,02/02/16,4,6,Unique Boycie,18:39,3/1,A4,500m,"1st 100, Others 30 Race Total 250",4.34,30.22 (1 3/4),(Trainer:J D Davy)你遇到了什么错误?
  • HI CasualDemon。你在csv文件中得到了多少行。应该有186行反映了场地的所有比赛,但我只得到16行。它只包括细节每场比赛的最后一只狗(包括训练师)。
  • 啊,我明白了,我只有 16 行。感谢您的澄清,我认为引发了异常或其他问题。我会进一步研究。

标签: python html loops web-scraping beautifulsoup


【解决方案1】:

我最好的猜测是,在您在第二个 for 循环下使用 rows.append 之前,因此在下面使用这两个循环复制该行为。

import csv
from bs4 import BeautifulSoup
import requests


html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754").text
soup = BeautifulSoup(html,'lxml')

rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
    track = header.find("div",    class_="track").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    date = header.find("div",   class_="date").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    datetime = header.find("div", class_="datetime").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    grade = header.find("div", class_="grade").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    distance = header.find("div", class_="distance").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    prizes = header.find("div", class_="prizes").get_text(strip=True).encode('ascii', 'ignore').strip("|")

    results = header.find_next_sibling("div",  class_="resultsBlock").find_all("ul", class_="line1")
    details = []
    for result in results:
        fin = result.find("li", class_="fin").get_text(strip=True)
        greyhound = result.find("li", class_="greyhound").get_text(strip=True)
        trap = result.find("li", class_="trap").get_text(strip=True)
        sp = result.find("li", class_="sp").get_text(strip=True)
        timeSec = result.find("li", class_="timeSec").get_text(strip=True)
        timeDistance = result.find("li", class_="timeDistance").get_text(strip=True)
        details.append({"greyhound": greyhound, "sp": sp, "fin": fin, "timeSec": timeSec, "timeDistance": timeDistance, "trap": trap, })

    results = header.find_next_sibling("div",  class_="resultsBlock").find_all("ul", class_="line2")
    for index, result in enumerate(results):
        trainer = result.find("li",  class_="trainer").get_text(strip=True)
        details[index]["trainer"] = trainer

    for detail in details:
        detail.update({"track": track, "date": date, "datetime": datetime, "grade": grade, "prizes": prizes})
        rows.append(detail)

with open("greyfile.csv", "w") as f:
    writer = csv.DictWriter(f,      ["track","date","trap","fin","greyhound","datetime","sp","grade","distance","prizes","timeSec","timeDistance","trainer"])

    for row in rows:
      writer.writerow(row)

【讨论】:

猜你喜欢
  • 2021-11-29
  • 2017-10-12
  • 2015-08-19
  • 1970-01-01
  • 2021-10-26
  • 1970-01-01
  • 2020-12-05
  • 2013-08-08
  • 2019-10-24
相关资源
最近更新 更多