【问题标题】:How to scrape data from a table using a loop to get all td data using python如何使用循环从表中抓取数据以使用 python 获取所有 td 数据
【发布时间】:2016-04-22 06:50:37
【问题描述】:

所以我正在尝试从网站获取一些数据。而且我很难获得数据。我可以得到球员的名字,但目前仅此而已。一直在尝试不同的事情。这是我试图通过的示例代码。请注意,有两个表(每个团队一个)。并且每个玩家的类从“偶数”到“奇数”或“奇数”到“偶数”下面的示例 html 文件交替,然后是我的 python 脚本。我标记了我想要的部分。我也在使用 python 2.7

`<table id="nbaGITeamStats" cellpadding="0" cellspacing="0">
      <thead class="nbaGIClippers">
         <tr>
            <th colspan="17">Los Angeles Clippers (1-0)</th> <!-- I want team name  -->
         </tr>
      </thead>
      <tbody><tr colspan="17">
         <td colspan="17" class="nbaGIBoxCat"><span>field goals</span><span>rebounds</span></td>
      </tr>
      <tr>
     <td class="nbaGITeamHdrStatsNoBord" colspan="1">&nbsp;</td>
     <td class="nbaGITeamHdrStats">pos</td>
     <td class="nbaGITeamHdrStats">min</td>
     <td class="nbaGITeamHdrStats">fgm-a</td>
     <td class="nbaGITeamHdrStats">3pm-a</td>
     <td class="nbaGITeamHdrStats">ftm-a</td>
     <td class="nbaGITeamHdrStats">+/-</td>
     <td class="nbaGITeamHdrStats">off</td>
     <td class="nbaGITeamHdrStats">def</td>
     <td class="nbaGITeamHdrStats">tot</td>
     <td class="nbaGITeamHdrStats">ast</td>
     <td class="nbaGITeamHdrStats">pf</td>
     <td class="nbaGITeamHdrStats">st</td>
     <td class="nbaGITeamHdrStats">to</td>
     <td class="nbaGITeamHdrStats">bs</td>
     <td class="nbaGITeamHdrStats">ba</td>
     <td class="nbaGITeamHdrStats">pts</td>
  </tr>
  <tr class="odd">
     <td id="nbaGIBoxNme" class="b"><a href="/playerfile/paul_pierce/index.html">P. Pierce</a></td> <!-- I want player name  -->
     <td class="nbaGIPosition">F</td> <!-- I want position name  -->
     <td>14:16</td> <!-- I want this  -->
     <td>1-4</td>  <!-- I want this  -->
     <td>1-2</td>  <!-- I want this  -->
     <td>2-2</td>  <!-- I want this  -->
     <td>+12</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>3</td>  <!-- I want this  -->
     <td>2</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>5</td>  <!-- I want this  -->
  </tr>

  <tr class="even">
     <td id="nbaGIBoxNme" class="b"><a href="/playerfile/blake_griffin/index.html">B. Griffin</a></td>  <!-- I want this  -->
     <td class="nbaGIPosition">F</td>  <!-- I want this  -->
     <td>26:19</td>  <!-- I want this  -->
     <td>5-14</td>  <!-- I want this  -->
     <td>0-1</td>  <!-- I want this  -->
     <td>1-1</td>  <!-- I want this  -->
     <td>+14</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>5</td>  <!-- I want this  -->
     <td>5</td>  <!-- I want this  -->
     <td>2</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>11</td>  <!-- I want this  -->
  </tr>
  <tr class="odd">
     <td id="nbaGIBoxNme" class="b"><a href="/playerfile/deandre_jordan/index.html">D. Jordan</a></td>  <!-- I want this  -->
     <td class="nbaGIPosition">C</td>  <!-- I want this  -->
     <td>26:27</td>  <!-- I want this  -->
     <td>6-7</td>  <!-- I want this  -->
     <td>0-0</td>  <!-- I want this  -->
     <td>3-5</td>  <!-- I want this  -->
     <td>+19</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>11</td>  <!-- I want this  -->
     <td>12</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>1</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>2</td>  <!-- I want this  -->
     <td>3</td>  <!-- I want this  -->
     <td>0</td>  <!-- I want this  -->
     <td>15</td>  <!-- I want this  -->
  </tr>
   <!-- And so on it will keep changing class from odd to even, even to odd  -->
    <!-- Also note there are to tables one for each team  -->
   <!--this is he table id>>> <table id="nbaGITeamStats" cellpadding="0" cellspacing="0"> -->`

这很长,但我想举一个切换类的例子,这里是我的 python 脚本,我打算在实际成功抓取数据后使用字典来保存数据。

import urllib
import urllib2
from bs4 import BeautifulSoup
import re
gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
for game in gamesForDay:
   url =  "http://www.nba.com/"+game
   page = urllib2.urlopen(url).read()
   soup = BeautifulSoup(page)
   for tr in soup.find_all('table id="nbaGITeamStats'):
    tds = tr.find_all('td')
    print tds

【问题讨论】:

    标签: python html web-scraping


    【解决方案1】:

    这样写是对的:

    for tr in soup.find_all('table', id='nbaGITeamStats')
    

    这对我来说很好(python 3.4):

    >>> import requests
    >>> from bs4 import BeautifulSoup
    >>> gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
    >>> 
    >>> for game in gamesForDay:
    ...    url =  "http://www.nba.com/"+game
    ...    page = requests.get(url).content
    ...    soup = BeautifulSoup(page, 'html.parser')
    ...    for tr in soup.find_all('table', id='nbaGITeamStats'):
    ...        tds = tr.find_all('td')
    ...        print(tds)
    

    要访问 td tag 中的内容,请使用 .text,如下所示:

    for td in tds:
       print(td.text)
    

    【讨论】:

    • 谢谢你这适用于 tds,我正在试图弄清楚如何在 14:16 之间获得 td 有没有办法按数字指出td 的?
    • 是的,您可以通过在需要的 td 上调用 .text 来访问 14:16。只需数一下您需要哪一个或设置一些条件即可获得它。
    【解决方案2】:

    这是我的解决方案。请注意,我有一个稍微不同的 BeautifulSoup 版本,不是来自 bs4,但逻辑可能不会太离谱。仍然在 Python2.7 上(在我的情况下是在 Windows 上)。

    您可能需要修复与上面显示的播放器部分不同的一些细微差别,但我认为您将能够处理该部分 :-)

    import urllib
    import urllib2
    # from bs4 import BeautifulSoup
    from BeautifulSoup import BeautifulSoup
    import re
    gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
    for game in gamesForDay:
       url =  "http://www.nba.com/"+game
       page = urllib2.urlopen(url).read()
       soup = BeautifulSoup(page)
    
       # fetch the tables you are interested in
       tables = soup.findAll(id="nbaGITeamStats")
       for table in tables:
           team_name = table.thead.tr.th.text
           # odd/even class rows (tr)
           rows = [ x for x in table.findAll('tr') if x.get('class',None) in ['odd','even'] ]
           for player in rows:
               # search the row cols based on 'id'
               player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
    
               # search the row cols based on 'class'
               player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
    
               # search for all td where the class is not defined
               player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
    
               print player_name, player_position, player_numbers
    

    对于 bs4(据我所知,BeautifulSoup4)必须进行一些修改。您仍然需要处理一些事情,但这会提取您想要的大部分数据:

    import urllib
    import urllib2
    from bs4 import BeautifulSoup
    import re
    gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
    for game in gamesForDay:
       url =  "http://www.nba.com/"+game
       page = urllib2.urlopen(url).read()
       soup = BeautifulSoup(page, "html.parser")
    
       # fetch the tables you are interested in
       tables = soup.findAll(id="nbaGITeamStats")
       for table in tables:
           team_name = table.thead.tr.th.text
           # odd/even class rows (tr)
           rows = table.find_all(attrs={'class':'odd'})
           rows.extend(table.find_all(attrs={'class':'even'}))
    
           for player in rows:
               # search the row cols based on 'id'
               player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
    
               # search the row cols based on 'class'
               player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
    
               # search for all td where the class is not defined
               player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
    
               print player_name, player_position, player_numbers
    

    【讨论】:

    • 这似乎是想像我想要的那样精确定位它,但我认为它不适用于我的 beatifulsoup 版本,我会尝试对其进行一些调整,但感谢您的回复
    • 如果有帮助,我通过pip install BeautifulSoup 安装了美丽的汤。我在 Windows 10,Python 2.7 上。
    • 出于某种原因,这不会打印出任何内容。我使用了您提供的第二部分。但我什么也没打印出来。
    • 我可以打印表格并显示数据。然后我可以打印 team_name。但是当我进入行时,它显示空列表。如果我将 team_name 放在 player_name 和其他所有内容的底部,由于某种原因它什么也不打印。
    • 这很奇怪。我逐字复制代码并运行良好,直到它中断(我提到你需要修复一些东西)但它确实打印了这么多pastebin.com/15HjtH5Q
    【解决方案3】:

    所以这就是我所做的一切。当然,我必须从这里清理代码,这得到了 sal 的大力帮助。

    import urllib2
    from bs4 import BeautifulSoup
    import re
    gamesForDay = ['/games/20151002/DENLAC/gameinfo.html']
    for game in gamesForDay:
       url =  "http://www.nba.com/"+game
       page = urllib2.urlopen(url).read()
       soup = BeautifulSoup(page, "html.parser")
    
       # fetch the tables you are interested in
       tables = soup.findAll(id="nbaGITeamStats")
       for table in tables:
            team_name = table.thead.tr.th.text
            # odd/even class rows (tr)
            rowsodd = table.find_all(attrs={'class':'odd'})
            rowseven =table.find_all(attrs={'class':'even'})
    
            for player in rowsodd:
                # search the row cols based on 'id'
                player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
    
                # search the row cols based on 'class'
                #player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
                #^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
                # search for all td where the class is not defined
                player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
    
                print player_name, player_numbers
            for player in rowseven:
                # search the row cols based on 'id'
                player_name = player.find('td', attrs={'id':'nbaGIBoxNme'}).text
    
                # search the row cols based on 'class'
                #player_position = player.find('td', attrs={'class':'nbaGIPosition'}).text
                 #^THERE ARE ONLY POSITIONS PUT ON PLAYERS AFTER THEY ARE PUT IN THE GAME.
                # search for all td where the class is not defined
                player_numbers = [ x.text for x in player.findAll('td', attrs={'class':None})]
                print player_name, player_numbers
    

    现在一切都显示出来了。我得把它清理得更好一点。但是数据要干净得多。从问题中可以看出,我实际上从未使用过美丽的汤。需要两行,或者也许有人知道更好的方法,这对我来说最容易获得我一直在寻求改进的数据。我希望其他人能从中吸取教训。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-10
      • 2014-02-28
      • 1970-01-01
      • 2020-01-22
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多