无法刮取“span”字符串元素beautifulsoup4答案

【问题标题】：Can't scrape "span" string element beautifulsoup4无法刮取“span”字符串元素beautifulsoup4
【发布时间】：2020-11-11 08:10:25
【问题描述】：

这是我要抓取的网站的 html 代码：

所以我想从span itemprop 标记中抓取Megapolitan 字符串。但是我收到了这个错误：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-7-ceb296a3cb50> in <module>()
     15     soup = BeautifulSoup(r.text)
     16 
---> 17     cat = soup.find('span', {'itemprop','name'}).text
     18     content = soup.find('div', {'class','read__content'}).text
     19     times = soup.find('div', {'class', 'read__time'}).text

AttributeError: 'NoneType' object has no attribute 'text'

据我所知，字符串没有被刮掉，因为即使我删除了.text，结果也没有。这是我的代码：

kompasurl = ('https://www.kompas.com/', 'https://news.kompas.com/', 'https://www.kompas.com/hype', 'https://www.kompas.com/food')
arti = []
for url in kompasurl:
  kompas = requests.get(url)
  beau = BeautifulSoup(kompas.content)
  news = beau.find_all('div', {'class','most__list clearfix'})
  
  for each in news:
    nu = each.find('div', {'class','most__count'}).text
    title = each.find('h4', {'class','most__title'}).text
    lnk = each.a.get('href')

    rcount = each.find('div', {'class','most__read'}).text
    r = requests.get(lnk)
    soup = BeautifulSoup(r.text)

    cat = soup.find('span', {'itemprop','name'}).text
    content = soup.find('div', {'class','read__content'}).text
    times = soup.find('div', {'class', 'read__time'}).text

    print(nu)
    print(title)
    print(lnk)
    print(rcount)
    print(times)
    print("")

    arti.append({
      'Top Number': nu,
      'Headline': title,
      'Content':content,
      'Category' : cat,
      'Link': lnk,
      'Date': times,
      'Read Count': rcount
      
    })


df = pd.DataFrame(arti)
df.to_csv('kompas.csv', index=False)

我需要帮助解决这个问题。任何帮助将不胜感激。

编辑：Here 是他们的新闻文章页面之一，其中包含我在上面分享的 html 元素。他们所有的新闻文章页面都有相同的 html 元素。

来自kompasurl的所有url都是新闻网站的主页。每篇新闻文章不在其主页中，而是在链接位于主页中的另一个页面上。

【问题讨论】：

kompas.content 是否包含 span 元素？如果不是，您需要先进行 javascript 渲染，然后再进行抓取。
我要删除的 span 元素位于来自 kompas.content 的已抓取链接的新闻页面中，然后存储在 r 变量中。 `kompas.content 中的所有 url 都是主站点，其中包含其发布的每篇新闻文章的链接。 Here 是他们的新闻文章页面之一，其中包含我在上面分享的 html 元素
请做以下实验：import requests; res = requests.get("https://www.kompas.com/"); print(res.content) 现在搜索span itemprop。它不在那里。因此我的评论。如果不存在，则需要先进行 javascript 渲染。
因为它不在kompas.com中，所以它在新闻文章页面上，网址来自lnk = each.a.get('href')，然后我从包含此代码r = requests.get(lnk)的新闻文章网址的变量中请求新闻文章并通过此代码soup = BeautifulSoup(r.text) 使用 BeautifulSoup。他们的问题不在kompas.com 或kompasurl 内的所有网址上，而是来自新闻文章页面

标签： python html web-scraping beautifulsoup

【解决方案1】：

有点令人困惑的信息，但认为以下是“重点”：

{'itemprop','name'}

应该除以: 而不是,

这应该有效（仅适用于第一个跨度）

from bs4 import BeautifulSoup
import requests; 

res = requests.get("https://megapolitan.kompas.com/read/2020/11/11/05080981/selasa-malam-anies-temui-rizieq-shihab-di-petamburan"); 
bs = BeautifulSoup(res.content)
cat = bs.find('span', {'itemprop':'name'}).text
print(cat)

要获得面包屑的所有跨度或特定的例如第三个，您应该调整您的find

【讨论】：