【问题标题】:How to scrape websites with Python and beautiful soup如何用 Python 和美汤抓取网站
【发布时间】:2015-08-27 16:32:39
【问题描述】:

我正在尝试从 bbc 体育网站上抓取结果。我已经得到了分数,但是当尝试添加团队名称时,程序会打印出 none 1-0 none(例如)。这是代码:

from bs4 import BeautifulSoup
import urllib.request
import csv 

url = 'http://www.bbc.co.uk/sport/football/teams/derby-county/results'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page)
for match in soup.select('table.table-stats tr.report'):
    team1 = match.find('span', class_='team-home')
    team2 = match.find('span', class_='team-away')
    score = match.abbr

    print(team1.string, score.string, team2.string)

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup


    【解决方案1】:

    您似乎正在搜索不存在的标签。例如 class_="team-home teams" 在 html 中,但 class_='team-home' 不在。以下代码打印第一个团队名称:

    tables = soup.find_all("table", class_="table-stats")
    
    tables[0].find("span", class_="team-home teams").text
    # u' Birmingham '
    

    【讨论】:

    • 我应该把这段代码放在哪里?我将第一行放在 for 循环之前,最后一行放在里面,但这给了我一个索引错误。
    • 您的变量与 html 不匹配,您需要更改它们才能找到任何内容。循环浏览我为表格编写的内容,而不是您拥有的内容。将 team1 更改为我的第二行。这会让你开始。如果变量没有找到任何东西,请确保您与 html 完全匹配。
    • 换一种说法,查看链接的页面源代码并按 ctrl-f 'team-home'。匹配项为零。这就是您的代码没有返回任何内容的原因。但是,对于“球队主场球队”,您需要很多比赛。
    【解决方案2】:

    这是一个可能的解决方案,它通过 BeautifulSoup 获取主客队名称、最终比分、比赛日期和比赛名称,并将其放入 DataFrame 中。

    import requests
    import pandas as pd
    from bs4 import BeautifulSoup
    
    #Get the relevant webpage set the data up for parsing
    url = "http://www.bbc.co.uk/sport/football/teams/derby-county/results"
    r = requests.get(url)
    soup=BeautifulSoup(r.content,"lxml")
    
    #set up a function to parse the "soup" for each category of information and put it in a DataFrame
    def get_match_info(soup,tag,class_name,column_name):
        info_array=[]
        for info in soup.find_all('%s'%tag,attrs={'class':'%s'%class_name}):
            info_array.append({'%s'%column_name:info.text})
        return pd.DataFrame(info_array)
    
    #for each category pass the above function the relevant information i.e. tag names
    date        = get_match_info(soup,"td","match-date","Date")
    home_team   = get_match_info(soup,"span","team-home teams","Home Team")
    score       = get_match_info(soup,"span","score","Score")
    away_team   = get_match_info(soup,"span","team-away teams","Away Team")
    competition = get_match_info(soup,"td","match-competition","Competition")
    
    #Concatenate the DataFrames to present a final table of all the above info 
    match_info = pd.concat([date,home_team,score,away_team,competition],ignore_index=False,axis=1)
    
    print match_info
    

    【讨论】:

      猜你喜欢
      • 2020-07-08
      • 2019-05-05
      • 1970-01-01
      • 2021-06-18
      • 2021-03-30
      • 1970-01-01
      • 2017-09-11
      • 2020-10-10
      • 2020-03-22
      相关资源
      最近更新 更多