从某个日期开始抓取数据答案

【问题标题】：scrape data from a date onwards从某个日期开始抓取数据
【发布时间】：2021-07-07 07:41:09
【问题描述】：

我只想在某个日期之后从表中抓取数据。下面的代码获取数据中的第一个日期（附加 url），但是我将如何创建一个 for 循环来仅从 2020 年 10 月 11 日和之前的所有行中提取数据？

我想创建一个for循环来提取这个表'table table-hover small horsePerformance'中某个日期之前的所有数据）

http://www.harness.org.au/racing/horse-search/?horseId=813476


with requests.Session() as s:
   try:
       webpage_response = s.get(horseurl, headers=headers)
   except requests.exceptions.ConnectionError:
        r.status_code = "Connection refused"
                            
   soup = bs(webpage_response.content, "html.parser")
   horseresult6 = soup.find('table', class_='table table-hover small horsePerformance')
   daysbetween = horseresult6.find('td', class_='date').get_text().strip()
   daysbetween24 = horseresult6.find('td', class_='date').find_next('td', class_='date').get_text().strip()

不过我觉得应该是这样的

for tr in horseresult6.find_all('tr')[1:]: 
     daysbetween = tr.find('td', class_='date').get_text().strip()
     if xdate > daysbetween:
         do something
     else:
         continue

当我尝试这个时它似乎不起作用

【问题讨论】：

soup.find 获取与您的参数匹配的第一个标签。使用soup.findAll 它会给你一个标签对象列表。然后使用 for 循环遍历该列表并检查这些标签中的日期。

标签： python beautifulsoup

【解决方案1】：

您可以使用< 和> 运算符比较日期。

方法如下：

import time

import requests
from bs4 import BeautifulSoup

horse_url = "http://www.harness.org.au/racing/horse-search/?horseId=813476"

with requests.Session() as s:
    try:
        webpage_response = s.get(horse_url)
    except requests.exceptions.ConnectionError:
        webpage_response.status_code = "Connection refused"

    table = BeautifulSoup(
        webpage_response.content,
        "html.parser",
    ).find('table', class_='table table-hover small horsePerformance')

    target_date = "11 Oct 2020"

    for row in table.find_all("tr")[1:]:  # skipping the header
        date = row.find("td", class_="date").find("a").getText()  # table date
        if time.strptime(date, "%d %b %Y") >= time.strptime(target_date, "%d %b %Y"):  # comparing the dates
            # do your parsing here, this is just an example
            print(f'{date} - {row.find("td", class_="stake").getText(strip=True)}')

输出：

05 Apr 2021 - $4,484
29 Mar 2021 - $595
23 Mar 2021 - $4,484
12 Mar 2021 - $220
08 Mar 2021 - $181
02 Mar 2021 - $263
19 Feb 2021 - $180
12 Feb 2021 - $1,200
26 Jan 2021 - $4,484

时光倒流：

target_date = "26 Jan 2021"

    for row in table.find_all("tr")[1:]:  # skipping the header
        date = row.find("td", class_="date").find("a").getText()  # table date
        if time.strptime(date, "%d %b %Y") <= time.strptime(target_date, "%d %b %Y"):  # comparing the dates
            # do your parsing here, this is just an example
            print(f'{date} - {row.find("td", class_="stake").getText(strip=True)}')

输出：

26 Jan 2021 - $4,484
14 Sep 2020 - $100
11 Sep 2020 - $616
04 Sep 2020 - $180
21 Aug 2020 - $180
17 Aug 2020 - $595
28 Jul 2020 - $4,291
21 Jul 2020 - $3,523
13 Jul 2020 - $300
30 Jun 2020 - $1,173
15 Jun 2020 - $100
30 May 2020 - $3,523
22 May 2020 - $500
12 May 2020 - $963
05 May 2020 - $3,523
02 May 2020 - $1,986
24 Apr 2020 - $144
09 Apr 2020 - $144
30 Mar 2020 - $1,225
10 Mar 2020 - $100
09 Dec 2019 - $595
02 Dec 2019 - $4,484
26 Nov 2019 - $4,484
19 Nov 2019 - $100
02 Nov 2019 - $4,484
27 Oct 2019 - $2,562
13 Oct 2019 - $700
31 May 2019 - $1,000
21 May 2019 - $4,484
07 May 2019 - $1,225
27 Apr 2019 - $595
21 Apr 2019 - $0
14 Apr 2019 - $0
07 Apr 2019 - $0

【讨论】：

如果您想从 1 月 26 日开始向后而不是向前，该怎么办
好吧，然后你改变你的target date并将比较从>=切换到<=以回到过去。
如果您想及时调用输出中的第二行怎么办？您是否需要在 for 循环中使用 enumerate 来调用代码中的第二个值？
是的，这是一种方法。