【问题标题】:How can I parse specific section data from webpage?如何解析网页中的特定部分数据?
【发布时间】:2018-08-20 02:39:57
【问题描述】:

我正在尝试使用 Beautiful soup 解析页面中的特定内容,您能告诉我,我该如何做到这一点? 代码:

import re
import pytz
import requests
import datetime
from flask import url_for
from bs4 import BeautifulSoup
from urllib.parse import urljoin


link = "http://www.espncricinfo.com/series/_/id/8038/season/2018/icc-world-cup-qualifiers/"

r = requests.get(link)
bigbash_article_html = r.text

soup = BeautifulSoup(bigbash_article_html, "html.parser")


details = soup.find("div",{"class":"module-list performers"})
bigbash_article_dict = {}


for div in details:
    image_div = div.find("div", {"class": "img-container player"})

我不知道如何继续,我希望输出如下

预期输出:

最佳得分手:

[{'playerimage':'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/default-player-logo-500.png&h=55&w=40&scale=crop&transparent=true','playername':'TP Ura','player-details':'PNG, Right-hand bat','runs':'188','innings':'2','Average':'94.00'},..............................................................................................}]

另一列也一样 顶级检票员:

[{'playerimage':'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/default-player-logo-500.png&h=55&w=40&scale=crop&transparent=true','playername':'Ehsan Khan','player-details':'HKG, Right-arm offbreak','wickets':'9','innings':'3','Average':'12.55'},..............................................................................................}]

【问题讨论】:

  • 该页面上似乎没有任何具有该类的 div。当您只是获取并保存 HTML(使用 Python、curl 或其他)并在编辑器中打开它时,您会看到这样的东西吗?如果没有,BeautifulSoup 显然也不会看到它。
  • 如果页面是动态生成的——例如,有一些 JavaScript 运行并在页面加载后添加一堆充满新 div 的“模块”——那么你将无法做任何事情这边走。 (您可以在 Python 中运行 JS 引擎,或驱动浏览器。或者您可以手动计算 JS 代码在做什么,然后在 Python 中执行。)但首先:您是否检查过 ESPN 是否有用于此的 API,在尝试刮之前? (而且,如果他们没有 API,他们的 ToS 是否禁止抓取它?)

标签: python python-3.x parsing web-scraping beautifulsoup


【解决方案1】:

选择类名为sub-moduleperformers 的元素中的所有列表项,然后从每个列表项中解析玩家详细信息。 例如

r = requests.get("http://www.espncricinfo.com/series/_/id/8038/season/2018/icc-world-cup-qualifiers/"
)

soup = BeautifulSoup(r.text, "html.parser")

toprunners = soup.select(".sub-module.performers li")

def player(li):
    name_and_details = li.select_one('p')
    name = name_and_details.a
    details = name.nextSibling
    stats = li.select_one('.overall-stats p')
    img = li.select_one('.focus-image')

    return {
        'player_name': name.text,
        'player_details': details.strip(', '),
        'player_image': img.attrs['src'],
        'runs': name_and_details.nextSibling.text,
        'innings': stats.span.text,
        'average': stats.nextSibling.span.text,
    }

players = [player(li) for li in toprunners]

In[2]: print(players)

[{'player_name': 'TP Ura', 'player_details': 'PNG, Right-hand bat', 'player_image': 'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/default-player-logo-500.png&h=55&w=40&scale=crop&transparent=true', 'runs': '188', 'innings': '2', 'average': '94.00'}, {'player_name': 'Mohammad Nabi', 'player_details': 'AFG, Right-hand bat', 'player_image': 'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/25913.png&h=55&w=40&scale=crop&transparent=true', 'runs': '181', 'innings': '3', 'average': '60.33'}, {'player_name': 'SO Hetmyer', 'player_details': 'WI, Left-hand bat', 'player_image': 'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/default-player-logo-500.png&h=55&w=40&scale=crop&transparent=true', 'runs': '171', 'innings': '3', 'average': '57.00'}, {'player_name': 'Ehsan Khan', 'player_details': 'HKG, Right-arm offbreak', 'player_image': 'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/default-player-logo-500.png&h=55&w=40&scale=crop&transparent=true', 'runs': '9', 'innings': '3', 'average': '12.55'}, {'player_name': 'Mujeeb Ur Rahman', 'player_details': 'AFG, Right-arm offbreak', 'player_image': 'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/default-player-logo-500.png&h=55&w=40&scale=crop&transparent=true', 'runs': '8', 'innings': '3', 'average': '15.25'}, {'player_name': 'JO Holder', 'player_details': 'WI, Right-arm medium-fast', 'player_image': 'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/391485.png&h=55&w=40&scale=crop&transparent=true', 'runs': '7', 'innings': '3', 'average': '21.28'}]

【讨论】:

  • 如何将列表播放器的字典集分成 3 个?
  • 您可以使用itertools.zip_longest 生成一个玩家列表,每个列表分为 3 个。 splitter = [iter(players)] * 3; threes = itertools.zip_longest(*splitter); print(list(threes))
  • 我用了一个计数器来做到这一点,你的方法很有趣,以前从未尝试过
  • @steve,您可以使用a = players[:3]; b = players[3:] 拆分列表。
【解决方案2】:

首先,您正在搜索错误的标签。你想要的内容是在<ul class="module-list performers"> 里面,而不是div 标签里面有相同的类名。

Top Run Scorers 表在<div id="r-0"> 标记内可用。每个玩家都位于li 标签内。您可以在li 标签中获取玩家的所有详细信息。

我将向您展示如何获取 Top Run Scorers 的图像、姓名和球员详细信息。

r = requests.get('http://www.espncricinfo.com/series/_/id/8038/season/2018/icc-world-cup-qualifiers')
soup = BeautifulSoup(r.text, 'lxml')

top_run_scorers = []
for player in soup.find('div', id='r-0').find_all('li'):
    image = player.find('img')['src']
    info = player.find('div', class_='content-meta')
    name = info.find('a').text
    details = info.p.contents[-1]
    top_run_scorers.append({'playerimage': image, 'playername': name, 'player-details': details})

print(top_run_scorers)

输出:

[{'player-details': ', PNG, Right-hand bat',
  'playerimage': 'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/default-player-logo-500.png&h=55&w=40&scale=crop&transparent=true',
  'playername': 'TP Ura'},
 {'player-details': ', AFG, Right-hand bat',
  'playerimage': 'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/25913.png&h=55&w=40&scale=crop&transparent=true',
  'playername': 'Mohammad Nabi'},
 {'player-details': ', WI, Left-hand bat',
  'playerimage': 'http://a.espncdn.com/combiner/i?img=/i/headshots/cricket/players/default-player-logo-500.png&h=55&w=40&scale=crop&transparent=true',
  'playername': 'SO Hetmyer'}]

【讨论】:

  • @steve,SO 不是代码编写服务,因此,我不会向您展示如何获取所有详细信息。使用上面的代码,尝试获取其余的详细信息。我想我已经向您展示了如何继续前进
  • 你能解释一下这条线是做什么的 details = info.p.contents[-1]?准确的内容[-1]
  • 尝试使用.contents 打印任何元素的内容。您将看到其所有内容的列表。您想要的详细信息出现在列表的最后一个索引中。这就是为什么 [-1]
猜你喜欢
  • 2018-04-18
  • 2013-08-06
  • 1970-01-01
  • 2020-09-23
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-11-26
相关资源
最近更新 更多