【问题标题】:Problems extracting text with the BeautifulSoup function使用 BeautifulSoup 函数提取文本时出现问题
【发布时间】:2021-10-22 10:20:59
【问题描述】:

我正在运行一些简单的网络抓取教程 但我觉得很难前进。

特别是,'title' 是唯一从中提取文本的元素之一。 对于剩余的“价格”和“状态”,它总是给我同样的错误。

AttributeError: 'NoneType' object has no attribute 'text'

import requests
from bs4 import BeautifulSoup
import pandas as pd
  
url = 'https://www.ebay.it/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=monitor&_sacat=0'
   
def get_data(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    return soup

def parse(soup):
    productlist = []
    results = soup.find_all('div', {'class' : 's-item__info clearfix'})
    for item in results:   
        product = {
            'title': item.find('h3', {'class': 's-item__title'}).text,
            'price': float(item.find('span', {'class': 's-item__price'})text.replace('EUR','').strip()),
            'status': item.find('span',{'class':'SECONDARY_INFO'})text, 
        }
        productlist.append(product)
    return productlist



def output(productlist):
    productsdf = pd.DataFrame(productlist)
    productsdf.to_csv('output.csv', index = False)
    print('Saved to CSV')
    return  productsdf

  soup = get_data(url)
    productlist =parse(soup)
    ug = output(productlist)

感谢任何想帮助我的人

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    更改选择所有项目的选择器:

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    url = "https://www.ebay.it/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=monitor&_sacat=0"
    
    
    def get_data(url):
        r = requests.get(url)
        soup = BeautifulSoup(r.text, "html.parser")
        return soup
    
    
    def parse(soup):
        productlist = []
        results = soup.select("#srp-river-results .s-item__info")  # <-- change here
        for item in results:
            product = {
                "title": item.find("h3", {"class": "s-item__title"}).text,
                "price": float(
                    item.find("span", {"class": "s-item__price"})
                    .text.replace("EUR", "")
                    .replace(",", ".")
                    .strip()
                    .split()[0]
                ),
                "status": item.find("span", {"class": "SECONDARY_INFO"}).text,
            }
            productlist.append(product)
        return productlist
    
    
    def output(productlist):
        productsdf = pd.DataFrame(productlist)
        # productsdf.to_csv("output.csv", index=False)
        # print("Saved to CSV")
        return productsdf
    
    
    soup = get_data(url)
    productlist = parse(soup)
    ug = output(productlist)
    print(ug)
    

    打印:

                                                                                                title   price              status
    0                                                 FASCIO a due monitor 2 x 17" Dual stand incluso   65.26      Ricondizionato
    1                MONITOR USATO RICONDIZIONATO DA 17" 19" 22" SCHERMO LCD PER PC O DVR VARI MARCHI   35.00      Ricondizionato
    2                                Terra LCD/LED monitor 27" 2760w, Earphone, audio, HDMI, DVI, VGA   20.00     Di seconda mano
    3                   Nuova inserzione22" LG Business monitor LED TFT 55,9 cm Nero USB ALTOPARLANTI   45.90      Ricondizionato
    4                    LG 24mb56hq-b 60cm 24" IPS MONITOR LED HDMI VGA 5ms altezza regolabile, VESA   25.50     Di seconda mano
    5                    MONITOR PC HP 22" ELITEDISPLAY E222 1920X1080 LED HD HDMI VGA DP USB GRADO A   80.00     Di seconda mano
    6                Lenovo ThinkCentre tio24gen3 23,8 pollici Full HD IPS Monitor Led-Nero Nuovo OVP   66.00       Nuovo (Altro)
    7                                        DELL E2216H 22" LED-LCD (TFT) TN FHD (1080p) del monitor   39.55      Ricondizionato
    
    ...
    

    【讨论】:

    • 感谢@Andrej Kesely 的回复 它工作正常,但我想了解您如何设法在页面上正确选择“结果”。我正在使用 Selector Gadget 帮助自己选择项目,但我不能。例如,我想在亚马逊上尝试同样的事情我该怎么做?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-12-16
    • 2011-03-22
    • 1970-01-01
    • 2022-10-13
    • 2020-05-08
    相关资源
    最近更新 更多