python中网络抓取中的属性错误答案

【问题标题】：attribute error in web scraping in pythonpython中网络抓取中的属性错误
【发布时间】：2022-01-21 20:50:27
【问题描述】：

编写了一些代码来抓取网站：https://books.toscrape.com/catalogue/page-1.html 但我收到错误消息：

Nontype object has no attribute text

找不到解决方案，我该如何解决这个错误？

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    
    all_books=[]
    
    url='https://books.toscrape.com/catalogue/page-1.html'
    headers=('https://developers.whatismybrowser.com/useragents/parse/22526098chrome-windows-blink')
    def get_page(url):
        page=requests.get(url,headers)
        status=page.status_code
        soup=BeautifulSoup(page.text,'html.parser')
        return [soup,status]
    
    #get all books links
    def get_links(soup):
        links=[]
        listings=soup.find_all(class_='product_pod')
        for listing in listings:
            bk_link=listing.find("h3").a.get("href")
            base_url='https://books.toscrape.com/catalogue/page-1.html'
            cmplt_link=base_url+bk_link
            links.append(cmplt_link)
        return links
        
    #extraxt info from each link
    def extract_info(links):
        for link in links:
            r=requests.get(link).text
            book_soup=BeautifulSoup(r,'html.parser')
    
            name=book_soup.find(class_='col-sm-6 product_main').text.strip()
            price=book_soup.find(class_='col-sm-6 product_main').text.strip()
            desc=book_soup.find(class_='sub-header').text.strip()
            cat=book_soup.find('"../category/books/poetry_23/index.html">Poetry').text.strip()
            book={'name':name,'price':price,'desc':desc,'cat':cat}
            all_books.append(book)
    
    pg=48
    while True:
        url=f'https://books.toscrape.com/catalogue/page-{pg}.html'
        soup_status=get_page(url)
        if soup_status[1]==200:
            print(f"scrapping page{pg}")
            extract_info(get_links(soup_status[0]))
            pg+=1
        else:
            print("The End")
            break
    
    df=pd.DataFrame(all_books)
    print(df)

【问题讨论】：

请添加完整的错误详细信息

标签： python pandas csv web-scraping beautifulsoup

【解决方案1】：

注意 首先，总是看看你的汤——这就是事实。内容总是与开发工具中的视图略有不同。

会发生什么？

您应该记住不同的问题：

base_url='https://books.toscrape.com/catalogue/page-1.html' 会导致 404 错误，这是导致您的“非类型对象没有属性文本”的第一个原因
您尝试找到像这样的类别cat=book_soup.find('"../category/books/poetry_23/index.html">Poetry').text.strip() 什么不起作用并会导致相同的错误
还有一些不会导致预期结果的选择，看看我的示例编辑它们，为您提供如何获得目标的线索。

如何解决？

将base_url='https://books.toscrape.com/catalogue/page-1.html' 更改为base_url='https://books.toscrape.com/catalogue/'
选择更具体的类别，它是面包屑中最后一个<a>：
```
cat=book_soup.select('.breadcrumb a')[-1].text.strip()
```

示例

import requests
from bs4 import BeautifulSoup
import pandas as pd


all_books=[]

url='https://books.toscrape.com/catalogue/page-1.html'
headers=('https://developers.whatismybrowser.com/useragents/parse/22526098chrome-windows-blink')
def get_page(url):
    page=requests.get(url,headers)
    status=page.status_code
    soup=BeautifulSoup(page.text,'html.parser')
    return [soup,status]

#get all books links
def get_links(soup):
    links=[]
    listings=soup.find_all(class_='product_pod')
    for listing in listings:
        bk_link=listing.find("h3").a.get("href")
        base_url='https://books.toscrape.com/catalogue/'
        cmplt_link=base_url+bk_link
        links.append(cmplt_link)
    return links
    
#extraxt info from each link
def extract_info(links):
    for link in links:
        r=requests.get(link).text
        book_soup=BeautifulSoup(r,'html.parser')
        name= name.text.strip() if (name := book_soup.h1) else None
        price= price.text.strip() if (price := book_soup.select_one('h1 + p')) else None
        desc= desc.text.strip() if (desc := book_soup.select_one('#product_description + p')) else None
        cat= cat.text.strip() if (cat := book_soup.select('.breadcrumb a')[-1]) else None
        book={'name':name,'price':price,'desc':desc,'cat':cat}
        all_books.append(book)

pg=48
while True:
    url=f'https://books.toscrape.com/catalogue/page-{pg}.html'
    soup_status=get_page(url)
    if soup_status[1]==200:
        print(f"scrapping page{pg}")
        extract_info(get_links(soup_status[0]))
        pg+=1
    else:
        print("The End")
        break

all_books

【讨论】：

它会生成错误 nontype object has no attribute text.But not in name and price。它会在 desc 中生成错误。
我的错，添加了错误的版本 - 编辑示例以捕获此行为 desc= desc.text.strip() if (desc := book_soup.select_one('#product_description + p')) else None
谢谢..它对我帮助很大，我的代码正在运行。
你能帮我在heroku上运行这段代码的csv文件吗？？
这将注定 ask a new question 专注于 heroku，我们将看看 - 每个问题都应该解决一个问题，以保持问题和答案的简洁并获得最佳答案。 非常感谢

【解决方案2】：

当你需要抓取元素的文本时，使用下面的函数。
它将保护您免受None 元素的影响

def get_text(book_soup,clazz):
  ele = book_soup.find(class_=clazz)
  return ele.text.strip() if ele is not None else ''

示例。而不是

    name=book_soup.find(class_='col-sm-6 product_main').text.strip()

做

    name=get_text(book_soup,'col-sm-6 product_main')

【讨论】：

它可以帮助我摆脱“无”，但我没有显示任何输出。
由于某些元素找不到 - 该函数将返回一个空字符串。
那么对我的输出有什么帮助
找不到任何元素所有元素都是空的。
print requests.get(link).text - 你看到你要找的数据了吗？