【问题标题】:How to fix "AttributeError: 'NoneType' object has no attribute 'text'" for <span>如何修复 <span> 的“AttributeError: 'NoneType' 对象没有属性 'text'”
【发布时间】:2019-07-02 15:28:09
【问题描述】:

我正在尝试使用 Python 3.7 和 BeautifulSoup 的网络爬虫。我从以下 html 中提取了“发布名称”、“按位置排序发布类别小类别标签”、“按团队发布类别小类别标签”的数据,但可以't extract "sort-by-commitment post-category small-category-label"(全职或非全职),而 html 结构似乎与其他结构相同:

<div class="posting" data-qa-posting-id="13f9db2f-7a80-4b50-9a61-005ad322ea2d">
   <div class="posting-apply" data-qa="btn-apply">
      <a href="https://jobs.lever.co/twitch/13f9db2f-7a80-4b50-9a61-005ad322ea2d" class="posting-btn-submit template-btn-submit hex-color">Apply</a>
   </div>
   <a class="posting-title" href="https://jobs.lever.co/twitch/13f9db2f-7a80-4b50-9a61-005ad322ea2d">
      <h5 data-qa="posting-name">Account Director - DACH</h5>
      <div class="posting-categories">
         <span href="#" class="sort-by-location posting-category small-category-label">Hamburg, Germany</span>
         <span href="#" class="sort-by-team posting-category small-category-label">Business Operations &amp; Go-To-Market – Advertising</span>
         <span href="#" class="sort-by-commitment posting-category small-category-label">Full-time</span>
      </div>
   </a>
</div>

我尝试为“发布类别”创建单独的汤,但没有奏效。

import requests
from bs4 import BeautifulSoup
from csv import writer

response = requests.get('https://jobs.lever.co/twitch')

soup = BeautifulSoup(response.text, 'html.parser')

posts = soup.findAll('div', {'class':'posting'})

with open('twitch.csv', 'w') as csv_file:
    csv_writer = writer(csv_file)

    headers = ['Position', 'Link', 'Location', 'Team', 'Commitment']

    csv_writer.writerow(headers)

    for post in posts:
        position = post.find('h5',{'data-qa':'posting-name'}).text
        link = post.find('a')['href']
        location = post.find('span',{'class':'sort-by-location posting-category small-category-label'}).text
        team = post.find('span',{'class':'sort-by-team posting-category small-category-label'}).text
        commitment = post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}).text
        csv_writer.writerow([position, link, location, team, commitment])

csv 中的预期结果将返回职位标题、链接(url)、位置、团队和承诺。

我现在收到以下错误:

 commitment = post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}).text
AttributeError: 'NoneType' object has no attribute 'text'

*编辑:数据集缺少最后一行,我不知道为什么:

<a class="posting-title" href="https://jobs.lever.co/twitch/c8cc56e7-75f6-4cac-9983-e0769db9dd2e">
   <h5 data-qa="posting-name">Applied Scientist Intern</h5>
   <div class="posting-categories">
      <span href="#" class="sort-by-location posting-category small-category-label">San Francisco, CA</span>
      <span href="#" class="sort-by-team posting-category small-category-label">University (Internships) – Engineering</span>
      <span href="#" class="sort-by-commitment posting-category small-category-label">Intern</span>

【问题讨论】:

  • commitment = category.find( 你应该改用post.find 吗?您的错误和代码不匹配。
  • 看起来commitment = post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}).text 引发错误,因为在“查找”期间找不到任何文本
  • 您的示例 html 没有 div'posting' 标记,因此 posts = soup.findAll('div', {'class':'posting'}) 生成一个空列表。请提供minimal reproducible example
  • @abdusco 感谢您指出这一点。这是我在尝试解决问题时的一次迭代中出现的错误消息。我已经把它固定在身体里了。
  • @wwii 我现在已经修复了示例 HTML。

标签: python python-3.x web-scraping beautifulsoup


【解决方案1】:

如果您检查 html,在某些情况下,commitment 会丢失,在这种情况下您必须提供 if 条件。现在试试下面的代码。

for post in posts:
        position = post.find('h5',{'data-qa':'posting-name'}).text
        link = post.find('a')['href']
        location = post.find('span',{'class':'sort-by-location posting-category small-category-label'}).text
        team = post.find('span',{'class':'sort-by-team posting-category small-category-label'}).text
        if post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}):
            commitment = post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}).text
            csv_writer.writerow([position, link, location, team, commitment])

我宁愿建议你使用css selector 而不是find

import requests
from bs4 import BeautifulSoup
from csv import writer

response = requests.get('https://jobs.lever.co/twitch')
soup = BeautifulSoup(response.text, 'html.parser')
posts = soup.select('div.posting')

with open('twitch.csv', 'w') as csv_file:
    csv_writer = writer(csv_file)

    headers = ['Position', 'Link', 'Location', 'Team', 'Commitment']

    csv_writer.writerow(headers)


    for post in posts:
        position = post.select_one('h5[data-qa="posting-name"]').text
        link = post.select_one('a')['href']
        location = post.select_one('.sort-by-location').text
        team = post.select_one('.sort-by-team').text
        if post.select_one('.sort-by-commitment'):
         commitment = post.select_one('.sort-by-commitment').text
        csv_writer.writerow([position, link, location, team, commitment])

【讨论】:

    【解决方案2】:

    你也可以使用try except:

    for post in posts:
        try:
            position = post.find('h5',{'data-qa':'posting-name'}).text
            link = post.find('a')['href']
            location = post.find('span',{'class':'sort-by-location posting-category small-category-label'}).text
            team = post.find('span',{'class':'sort-by-team posting-category small-category-label'}).text
            commitment = post.find('span',{'class':'sort-by-commitment posting-category small-category-label'}).text
            csv_writer.writerow([position, link, location, team, commitment])
        except:
            continue
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-01-10
      • 1970-01-01
      • 1970-01-01
      • 2021-08-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多