【问题标题】:how to get attribute data using python beautiful soup如何使用python美汤获取属性数据
【发布时间】:2020-07-05 20:12:15
【问题描述】:

您好,我正在尝试使用 python beautiful-soup 网络爬虫从 imdb 获取数据,我已按照在线文档进行操作,能够使用此代码检索所有数据

from requests import get
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

movie_containers = html_soup.find_all('div', class_ = 'image')
print(movie_containers)

使用上面的代码,我可以检索到标记为图像的 div 类中所有数据的列表,如下所示

<div class="image">
<a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
<img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg@@._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep1</div>
</div>
</a> </div>
<div class="image">
<a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
<img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg@@._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
<div>S1, Ep2</div>
</div>
</a> </div>

但我试图从结果中获取属性 data-const 的值,我只想显示 data-const 属性的值而不是整个 html 结果预期结果:tt1486497, tt1485650

【问题讨论】:

    标签: python beautifulsoup pycharm web-crawler


    【解决方案1】:

    改为使用div 正在使用的类名。

    from bs4 import BeautifulSoup
    
    html = """<div class="image">
    <a href="/title/tt1486497/" itemprop="url" title="Pilot"> <div class="hover-over-image zero-z-index" data-const="tt1486497">
    <img alt="Pilot" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BNTExMDIwNTUyNF5BMl5BanBnXkFtZTcwNzU5MDg1Mg@@._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
    <div>S1, Ep1</div>
    </div>
    </a> </div>
    <div class="image">
    <a href="/title/tt1485650/" itemprop="url" title="The Night of the Comet"> <div class="hover-over-image zero-z-index" data-const="tt1485650">
    <img alt="The Night of the Comet" class="zero-z-index" height="126" src="https://m.media-amazon.com/images/M/MV5BMjIyNDczNDYzNV5BMl5BanBnXkFtZTcwNDk1MDQ4Mg@@._V1_UX224_CR0,0,224,126_AL_.jpg" width="224"/>
    <div>S1, Ep2</div>
    </div>
    </a> </div>"""
    
    soup = BeautifulSoup(html, "lxml")
    
    for div in soup.find_all("div", attrs={"class":"hover-over-image zero-z-index"}):
        print(div["data-const"])
    

    输出:

    tt1486497
    tt1485650
    

    【讨论】:

      【解决方案2】:

      尝试以下方式:

      for dc in movie_containers.select('div.hover-over-image'):
          print(dc['data-const'])
      

      输出:

      tt1486497
      tt1485650
      

      【讨论】:

        【解决方案3】:

        我建议使用requests-html。它比仅仅使用漂亮的汤更直观。

        例子:

        from requests_html import HTMLSession
        
        url = 'https://www.imdb.com/title/tt1405406/episodes?season=1&ref_=tt_eps_sn_1'
        session = HTMLSession()
        
        response = session.get(url)
        html = response.html
        
        imageContainers = html.find_all("div.image")
        
        dataConsts = list(map(lambda x: x.find("a", first=True).attrs["data-const"], imageContainers))
        

        这应该完全符合您的需要,但我无法测试它

        祝你好运!

        【讨论】:

          猜你喜欢
          • 2014-04-16
          • 2012-08-01
          • 2017-12-11
          • 2020-07-16
          • 2018-01-13
          • 2016-08-16
          • 2013-06-15
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多