【问题标题】:How do I web scrape properly to get all data easily?我如何正确地进行网络抓取以轻松获取所有数据?
【发布时间】:2021-09-05 08:26:24
【问题描述】:

我是网络抓取的新手。我试图获得一些 pub_ratings。另外,我想从 yelp 页面获取尽可能多的数据。

这是我的代码:

pub_ratings = []
pub_reviews = []
pub_names = []
num_reviews = []

#for loop for all pages

for i in range(0,240,10):       
    url = "https://www.yelp.ie/search?find_desc=Pubs+%26+Bars&find_loc=london&ns=1&start={}".format(i)
    r = requests.get(url)
    soup_240 = BeautifulSoup(r.content, 'html.parser')
    sleep(1)
    
    all_data = soup_240.findAll('div', class_="container__09f24__21w3G hoverable__09f24__2nTf3 margin-t3__09f24__5bM2Z margin-b3__09f24__1DQ9x padding-t3__09f24__-R_5x padding-r3__09f24__1pBFG padding-b3__09f24__1vW6j padding-l3__09f24__1yCJf border--top__09f24__8W8ca border--right__09f24__1u7Gt border--bottom__09f24__xdij8 border--left__09f24__rwKIa border-color--default__09f24__1eOdn")



#filling them with data

    for data in all_data:
        
        pub_names.append(data.find('a', class_='css-166la90').get_text(separator=' '))  
        num_reviews.append(data.find('span',class_='reviewCount__09f24__EUXPN css-e81eai').get_text(separator=' '))
        pub_ratings.append(data.find('div', aria_label="").get_text(separator=' '))

这是我的错误

AttributeError: 'NoneType' 对象没有属性 'get_text'

【问题讨论】:

    标签: web web-scraping


    【解决方案1】:

    数据以 Json 形式嵌入页面中。要解析它,您可以使用下一个示例:

    import json
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.yelp.ie/search?find_desc=Pubs+%26+Bars&find_loc=london&ns=1"
    
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    data = BeautifulSoup(
        soup.select_one('script[type="application/json"]').contents[0],
        "html.parser",
    ).contents[0]
    data = json.loads(data)
    
    # uncomment to print all data:
    # print(json.dumps(data, indent=4))
    
    
    def search_biz(d):
        if isinstance(d, dict):
            if "bizId" in d:
                yield d["searchResultBusiness"]
            else:
                for v in d.values():
                    yield from search_biz(v)
        elif isinstance(d, list):
            for v in d:
                yield from search_biz(v)
    
    
    for b in search_biz(data):
        print(b["name"])
        print(
            "Rating: {}\nAddress: {}\nPhone: {}\n".format(
                b["rating"], b["formattedAddress"], b["phone"]
            )
        )
    

    打印:

    The Harp
    Rating: 4.5
    Address: 47 Chandos Place
    Phone: 020 7836 0291
    
    Cahoots Bar
    Rating: 4.5
    Address: 13 Kingly Court
    Phone: 020 7352 6200
    
    The Monkey Puzzle
    Rating: 4.5
    Address: 30 Southwick Street
    Phone: 020 7723 0143
    
    The Crobar
    Rating: 4.5
    Address: 17 Manette Street
    Phone: 020 7439 0831
    
    The Queen’s Head
    Rating: 4
    Address: 15 Denman Street
    Phone: 020 7437 1540
    
    The Queens Arms
    Rating: 4.5
    Address: 11 Warwick Way
    Phone: 020 7834 3313
    
    The Cauldron
    Rating: 4.5
    Address: 79 Stoke Newignton Road
    Phone: 0117 456 2442
    
    Coach and Horses
    Rating: 4
    Address: 5 Bruton Street
    Phone: 020 7629 4123
    
    The Victoria
    Rating: 4.5
    Address: 10a Strathearn Place
    Phone: 020 7724 1191
    
    The Ordnance
    Rating: 4
    Address: 29 Ordnance Hill
    Phone: 020 7722 0278
    
    

    【讨论】:

    • 感谢您的帮助,但我仍然收到错误 - 名称“标题”未定义。此外,我想将这些数据打包到 DataFrame 中并从所有 24 个页面中获取数据,所以我可以使用带范围的 for 循环(就像我在代码顶部使用的那个)还是你会建议其他的东西?
    • @Wizard 你可以删除headers=headers,我复制错了。没有它也可以工作。
    • @Wizard 您可以在 URL 中使用 &start={} 参数来遍历结果并将其存储到数据帧中。
    • 此代码是否在您的电脑上工作,因为我收到其他错误 AttributeError: 'Response' object has no attribute 'contents'。
    • 终于成功了,我收到了这个错误,因为阻止了我访问该网络:)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-02-15
    • 2023-03-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多