【问题标题】:How do I resolve list index out of range error?如何解决列表索引超出范围错误?
【发布时间】:2022-01-20 20:35:56
【问题描述】:

我有一个将数据抓取到数据框中的代码

import os
import re
import threading
from multiprocessing.pool import ThreadPool
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver



class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        # Un-comment next line to supress logging:
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit()  # clean up driver when we are cleaned up
        # print('The driver has been "quitted".')


threadLocal = threading.local()


def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver


class GameData:
    def __init__(self):
        self.date = []
        self.time = []
        self.game = []
        self.score = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []
        self.country = []
        self.league = []


def generate_matches(table):
    tr_tags = table.findAll('tr')
    for tr_tag in tr_tags:
        if 'class' in tr_tag.attrs and 'dark' in tr_tag['class']:
            th_tag = tr_tag.find('th', {'class': 'first2 tl'})
            a_tags = th_tag.findAll('a')
            country = a_tags[0].text
            league = a_tags[1].text
        else:
            td_tags = tr_tag.findAll('td')
            if len(td_tags) > 0:  # or just if td_tags
                yield [td_tags[0].text, td_tags[1].text, td_tags[2].text, td_tags[3].text,
                       td_tags[4].text, td_tags[5].text, country, league]


def parse_data(url, return_urls=False):
    browser = create_driver()
    browser.get(url)
    soup = bs(browser.page_source, "lxml")
    div = soup.find('div', {'id': 'col-content'})
    table = div.find('table', {'class': 'table-main'})
    h1 = soup.find('h1').text
    print(h1)
    m = re.search(r'\d+ \w+ \d{4}$', h1)
    game_date = m[0]
    game_data = GameData()
    for row in generate_matches(table):
        game_data.date.append(game_date)
        game_data.time.append(row[0])
        game_data.game.append(row[1])
        # Score present?
        if ':' not in row[2]:
            # No, shift a few columns right:
            row[5], row[4], row[3], row[2] = row[4], row[3], row[2], nan
        game_data.score.append(row[2])
        game_data.home_odds.append(nan if row[3] == '-' else row[3])
        game_data.draw_odds.append(nan if row[4] == '-' else row[4])
        game_data.away_odds.append(nan if row[5] == '-' else row[5])
        game_data.country.append(row[6])
        game_data.league.append(row[7])

    if return_urls:
        span = soup.find('span', {'class': 'next-games-date'})
        a_tags = span.findAll('a')
        urls = ['https://www.oddsportal.com' + a_tag['href'] for a_tag in a_tags]
        return game_data, urls
    return game_data


if __name__ == '__main__':
    results = None
    pool = ThreadPool(5)  # We will be getting, however, 7 URLs
    # Get today's data and the Urls for the other days:
    game_data_today, urls = pool.apply(parse_data, args=('https://www.oddsportal.com/matches/soccer', True))
    urls.pop(1)  # Remove url for today: We already have the data for that
    game_data_results = pool.imap(parse_data, urls)
    for i in range(8):
        try:
            game_data = game_data_today if i == 1 else next(game_data_results)
            result = pd.DataFrame(game_data.__dict__)
            if results is None:
                results = result
            else:
                results = results.append(result, ignore_index=True)
    print(results)
    # ensure all the drivers are "quitted":
    del threadLocal
    import gc

    gc.collect()  # a little extra insurance

但是我在以下位置得到部分输出:

h1 = soup.find('h1').text
print(h1)

Next Soccer Matches: Today, 18 Dec 2021
Next Soccer Matches: Wednesday, 22 Dec 2021
Next Soccer Matches: Thursday, 23 Dec 2021
Next Soccer Matches: Friday, 24 Dec 2021
Next Soccer Matches: Tuesday, 21 Dec 2021
Next Soccer Matches: Monday, 20 Dec 2021
Next Soccer Matches: Yesterday, 17 Dec 2021
Next Soccer Matches: Tomorrow, 19 Dec 2021

当我检查时,对于论点

当我进一步探索时

        td_tags = tr_tag.findAll('td')
        print(td_tags)
        if len(td_tags) > 0:  # or just if td_tags
            print(len(td_tags))

我收到了 2 个案例

[<td class="table-time datet t1639839600-1-1-0-0">15:00</td>, <td class="name table-participant"><a href="/soccer/england/npl-premier-division/south-shields-witton-albion-COXLb3wr/">South Shields - Witton</a><span class="ico-event-info" onmouseout="allowHideTootip(true);delayHideTip(200);" onmouseover="toolTip('Postponed due to Covid-19.', this, event, '4');allowHideTootip(false);delayHideTip(200);return false;"> </span></td>, <td class="center bold table-odds table-score">postp.</td>, <td class="odds-nowrp" xodd="1.36" xoid="E-4v5i6xv464x0xd4ur7"><a href="" onclick="globals.ch.togle(this , 'E-4v5i6xv464x0xd4ur7');return false;" xparam="odds_text">1.36</a></td>, <td class="odds-nowrp" xodd="4.84" xoid="E-4v5i6xv498x0x0"><a href="" onclick="globals.ch.togle(this , 'E-4v5i6xv498x0x0');return false;" xparam="odds_text">4.84</a></td>, <td class="odds-nowrp" xodd="7.05" xoid="E-4v5i6xv464x0xd4ur8"><a href="" onclick="globals.ch.togle(this , 'E-4v5i6xv464x0xd4ur8');return false;" xparam="odds_text">7.05</a></td>, <td class="center info-value">11</td>]
7
[<td class="table-time datet t1639839600-1-1-0-0">15:00</td>, <td class="name table-participant" colspan="2"><a href="/soccer/england/npl-premier-division/warrington-town-scarborough-athletic-ObyQcNhl/">Warrington - Scarborough</a></td>, <td class="odds-nowrp" xodd="1.8" xoid="E-4v5i7xv464x0xd4ur9"><a href="" onclick="globals.ch.togle(this , 'E-4v5i7xv464x0xd4ur9');return false;" xparam="odds_text">1.80</a></td>, <td class="odds-nowrp" xodd="3.59" xoid="E-4v5i7xv498x0x0"><a href="" onclick="globals.ch.togle(this , 'E-4v5i7xv498x0x0');return false;" xparam="odds_text">3.59</a></td>, <td class="odds-nowrp" xodd="3.91" xoid="E-4v5i7xv464x0xd4ura"><a href="" onclick="globals.ch.togle(this , 'E-4v5i7xv464x0xd4ura');return false;" xparam="odds_text">3.91</a></td>, <td class="center info-value">12</td>]
6

我在以下位置收到 IndexError:

line 67, in generate_matches
    yield [td_tags[0].text, td_tags[1].text, td_tags[2].text, td_tags[3].text,
IndexError: list index out of range

我该如何解决这个问题?

Lorem ipsum dolor sit amet,consectetur adipiscing elit。 Quisque pellentesque, ipsum vel tempor suscipit, turpis mauris venenatis leo, nec vestibulum arcu urna et quam

【问题讨论】:

  • len 3 =4 个对象?
  • 如果 len(td_tags) = 3 则只能获取 0、1、2 索引的值,并且 td_tags[3] 和接下来的其他索引将失败。也许您需要使用 try/except 子句来捕获错误并使用您得到的其余结果。
  • 那个子句怎么写?
  • 如果 len(td_tags) = 3 则访问 `td_tags[3]` 将导致 IndexError 并且您正在访问具有更大索引的 td_tags。
  • len = 7时如何检查,那些len的内容是什么?

标签: python selenium beautifulsoup


【解决方案1】:

访问td_tags 时获得IndexError 的事实暗示len(td_tags)并不总是 6 或7。它至少有一次

看看这段代码:

if len(td_tags) > 0:  # or just if td_tags
    yield [td_tags[0].text, td_tags[1].text, td_tags[2].text, td_tags[3].text,
           td_tags[4].text, td_tags[5].text, country, league]

您正在使用从 0 到 5 的列表索引。 要获得至少 6 个 IndexError 需要 len(td_tags)! 所以把第一行改成:

if len(td_tags) > 5:

那应该去掉IndexError

【讨论】:

  • 当我进一步探索时,我得到了 2 个案例,更新了问题并附上了解释
  • 对!我一直在寻找超过 6 个哈哈的案例,而解决方案则在不到。是的,一个值是 len 1
猜你喜欢
  • 2015-04-27
  • 2020-09-01
  • 1970-01-01
  • 2021-10-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多