【问题标题】:Beautiful Soup throws `IndexError`美丽的汤抛出`IndexError`
【发布时间】:2013-09-30 03:34:52
【问题描述】:

我正在使用 Python 2.7Beautiful Soup 3.2 抓取网站。我对这两种语言都很陌生,但是从文档中我开始了一些。

我正在阅读下一个文档: http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#contents http://thepcspy.com/read/scraping-websites-with-python/

我现在所做的和拥有的(失败的部分):

# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup

# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football'
source = urllib2.urlopen(url)

# Turn the saced source into a BeautifulSoup object
soup = BeautifulSoup(source)

# From the source HTML page, search and store all <td class="home">..</td> and it's content
hometeamsTd = soup.findAll('td', { "class" : "home" })
# Loop through the tag and store only the needed information, being the home team
hometeams = [tag.contents[1] for tag in hometeamsTd]

# From the source HTML page, search and store all <td class="home">..</td> and it's content
awayteamsTd = soup.findAll('td', { "class" : "away" })
# Loop through the tag and store only the needed information, being the away team
awayteams = [tag.contents[1] for tag in awayteamsTd]

hometeamsTdtag.contents 的内容如下所示:

[
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Harkemase Boys', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6077" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'RKC Waalwijk', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-427" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'PSV', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Ajax', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-2" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'SC Heerenveen', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-14" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Feyenoord', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-9" />],
    [<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />]
]

awayteamsTdtag.contents 的内容如下所示:

[
    [u'Away-team'], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-13" />, u'NEC', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-11" />, u'Heracles', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428" />, u'Stormvogels Telstar', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />], 
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-419" />, u'FC Volendam', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-7" />, u'FC Twente', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
    [<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-415" />, u'FC Dordrecht', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />]
]

我试图解决但还没有完全解决的问题是:

  • 代码awayteams = [tag.contents[1] for tag in awayteamsTd] 出现错误:IndexError: list index out of range。这当然是正确的,因为您可以在 awayteamsTdtag.contents 的输出中看到,有一个第一个条目 [u'Away-team']。这就是它失败的原因。但是我怎样才能删除/跳过这个?
  • 在 hometeams 输出中一切正常,但我想排除出现 Dutch KNVB Beker 文本的那些

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    问题在于“away”单元格(列名)位于带有“away”类的 td 内:

    <thead class="title">
        ...
        <tr class="sub">
          ...  
          <td>Home-team</td>
          <td></td>
          <td class="away">Away-team</td>
          <td class="broadcast">Broadcast</td>
        </tr>
      </thead>
    </thead>
    

    使用切片跳过它:

    awayteamsTd = soup.findAll('td', { "class" : "away" })[1:]
    

    另外,如果您想从主队列表中排除 Dutch KNVB Beker,请在列表推导表达式中添加一个条件:

    hometeams = [tag.contents[1] for tag in hometeamsTd if tag.contents[1] != 'Dutch KNVB Beker']
    

    【讨论】:

      【解决方案2】:
      awayteams = []
      for tag in awayteamsTd:
          if len(tag.contents) > 1:
              awayteams.append(tag.contents[1])
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-01-15
        • 1970-01-01
        • 1970-01-01
        • 2023-03-12
        • 1970-01-01
        • 1970-01-01
        • 2016-09-26
        相关资源
        最近更新 更多