【问题标题】:Beautiful soup - Find children tag attribute content美汤——查找儿童标签属性内容
【发布时间】:2018-07-06 18:08:24
【问题描述】:

源代码:

<div class="wrapper">
    <div id="mask" style="display: none;"></div>
    <div id="video">
        <span id="pid" hidden="">2</span>
        <div poster="https://thumbs.vodgc.net/57377706F7D28069F41A23A14DC5CC64.jpg?673333" autoplay="true" data-setup="{ &quot;techOrder&quot;: [&quot;html5&quot;]}"
            preload="none" class="video-js vjs-default-skin vjs-controls-enabled vjs-workinghover vjs-has-started media_player-dimensions vjs-paused vjs-user-inactive"
            id="media_player" role="region" aria-label="video player">
            <video id="media_player_html5_api" class="vjs-tech" preload="none" data-setup="{ &quot;techOrder&quot;: [&quot;html5&quot;]}"
                autoplay="" src="blob:https://api.vodgc.net/5bb5a7a7-6c9b-49f1-883b-784871f95d8b">
                <source src="https://vod.vodgc.net/manifest/57377706F7D28069F41A23A14DC5CC64.m3u8" type="application/x-mpegURL">
            </video>
            <div>

我正在尝试在“source”标签中查找“src”属性的内容,但结果却是 None 或空列表。

这是我的代码:

from urllib import request
from bs4 import BeautifulSoup

hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
       'Accept': 
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

url = 'https://www.eltrecetv.com.ar/programas/simona/capitulos-completos/capitulo-4_099474'
req = request.Request(url, headers=hdr)

page = request.urlopen(req)

soup = BeautifulSoup(page,'lxml')


sources = soup.find('div', class_ ='wrapper')

for tag in sources:
    video = tag.find_next_siblings('video')
    print(video)

【问题讨论】:

  • 为什么不直接使用requests?抱歉有点跑题了。
  • 我建议使用 dict 而不是直接 id 来查找它。它可能会解决这个问题。

标签: python html web-scraping beautifulsoup


【解决方案1】:

通过将source 标记传递给find_all 方法来访问src 属性:

from bs4 import BeautifulSoup as soup

s = """
<div class="wrapper">
<div id="mask" style="display: none;"></div>
<div id="video">
    <span id="pid" hidden="">2</span>
    <div poster="https://thumbs.vodgc.net/57377706F7D28069F41A23A14DC5CC64.jpg?673333" autoplay="true" data-setup="{ &quot;techOrder&quot;: [&quot;html5&quot;]}"
        preload="none" class="video-js vjs-default-skin vjs-controls-enabled vjs-workinghover vjs-has-started media_player-dimensions vjs-paused vjs-user-inactive"
        id="media_player" role="region" aria-label="video player">
        <video id="media_player_html5_api" class="vjs-tech" preload="none" data-setup="{ &quot;techOrder&quot;: [&quot;html5&quot;]}"
            autoplay="" src="blob:https://api.vodgc.net/5bb5a7a7-6c9b-49f1-883b-784871f95d8b">
            <source src="https://vod.vodgc.net/manifest/57377706F7D28069F41A23A14DC5CC64.m3u8" type="application/x-mpegURL">
        </video>
        <div>
"""
d = soup(s, 'lxml')
print([i['src'] for i in d.find_all('source')])

输出:

['https://vod.vodgc.net/manifest/57377706F7D28069F41A23A14DC5CC64.m3u8']

【讨论】:

  • 我认为,OP 期望从网络直接获得相同的结果。
  • 我知道这不是主题,但是将 BeautifulSoup 导入为 soup 是一个好习惯吗?
  • @KeyurPotdar 它只是在当前命名空间中为 BeautifulSoup 类创建别名。它完全是可选的,但是,我认为在创建 BeautifulSoup 对象时它更短更容易。
猜你喜欢
  • 1970-01-01
  • 2016-03-22
  • 2021-04-16
  • 2015-12-25
  • 1970-01-01
  • 2018-07-18
  • 2016-08-16
  • 2017-12-05
  • 1970-01-01
相关资源
最近更新 更多