【问题标题】:Extracting image caption and image url using BeautifulSoup使用 BeautifulSoup 提取图像标题和图像 url
【发布时间】:2017-09-17 11:58:06
【问题描述】:

我正在尝试使用 BeautifulSoup 从一篇文章中提取图片网址和图片标题。我可以将文章的图像 url 和图像标题与前后 HTML 分开,但我不知道如何将这两者与它们的 html 标签分开。这是我的代码:

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-
koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-
letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'image'})

我试图提取的两个部分是 src= 和 title= 部分。任何关于如何完成这两个解析的想法都将不胜感激。

【问题讨论】:

    标签: python html parsing beautifulsoup


    【解决方案1】:

    尝试以下方法提取所有图像标签

    img = soup.findAll('img')
    #depending on how many images are here you will probably need to loop through img
    src = img.get('src')
    title = img.get('title')
    

    【讨论】:

      【解决方案2】:
      from bs4 import BeautifulSoup
      import requests
      url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
      r = requests.get(url)
      html = r.text
      soup = BeautifulSoup(html, 'lxml')
      links = soup.find_all('div', {'class': 'image'})
      print [i.find('img')['src'] for i in links]
      print [i.find('img')['title'] for i in links]
      

      【讨论】:

      • 正确的标记是html5lib而不是lxml,它用于xml
      【解决方案3】:

      迟到的答案,但您可以使用:

      from bs4 import BeautifulSoup
      import requests
      url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
      r = requests.get(url)
      html = r.text
      soup = BeautifulSoup(html, "html5lib")
      links = soup.find_all('div', {'class': 'image'})
      if links:
          print(links[0].find('img')['src'])
          print(links[0].find('img')['title'])
      

      输出:

      http://mma.prnewswire.com/media/491859/Koert_van_Mensvoort.jpg?w=950

      荷兰哲学家 Koert van Mensvoort – Next Nature 的创始人 科技大学“Next Nature”网络和研究员 埃因霍温——写了一封“致人类的信”以支持 国际地球日。 (PRNewsfoto/Next Nature Network)

      【讨论】:

        猜你喜欢
        • 2023-01-21
        • 2021-03-26
        • 1970-01-01
        • 2019-12-30
        • 1970-01-01
        • 2021-12-16
        • 2019-10-25
        • 2021-12-05
        • 1970-01-01
        相关资源
        最近更新 更多