我如何正确地进行网络抓取以轻松获取所有数据？答案

【问题标题】：How do I web scrape properly to get all data easily?我如何正确地进行网络抓取以轻松获取所有数据？
【发布时间】：2021-09-05 08:26:24
【问题描述】：

我是网络抓取的新手。我试图获得一些 pub_ratings。另外，我想从 yelp 页面获取尽可能多的数据。

这是我的代码：

pub_ratings = []
pub_reviews = []
pub_names = []
num_reviews = []

#for loop for all pages

for i in range(0,240,10):       
    url = "https://www.yelp.ie/search?find_desc=Pubs+%26+Bars&find_loc=london&ns=1&start={}".format(i)
    r = requests.get(url)
    soup_240 = BeautifulSoup(r.content, 'html.parser')
    sleep(1)
    
    all_data = soup_240.findAll('div', class_="container__09f24__21w3G hoverable__09f24__2nTf3 margin-t3__09f24__5bM2Z margin-b3__09f24__1DQ9x padding-t3__09f24__-R_5x padding-r3__09f24__1pBFG padding-b3__09f24__1vW6j padding-l3__09f24__1yCJf border--top__09f24__8W8ca border--right__09f24__1u7Gt border--bottom__09f24__xdij8 border--left__09f24__rwKIa border-color--default__09f24__1eOdn")



#filling them with data

    for data in all_data:
        
        pub_names.append(data.find('a', class_='css-166la90').get_text(separator=' '))  
        num_reviews.append(data.find('span',class_='reviewCount__09f24__EUXPN css-e81eai').get_text(separator=' '))
        pub_ratings.append(data.find('div', aria_label="").get_text(separator=' '))

这是我的错误

AttributeError: 'NoneType' 对象没有属性 'get_text'

【问题讨论】：

标签： web web-scraping

【解决方案1】：

数据以 Json 形式嵌入页面中。要解析它，您可以使用下一个示例：

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.yelp.ie/search?find_desc=Pubs+%26+Bars&find_loc=london&ns=1"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = BeautifulSoup(
    soup.select_one('script[type="application/json"]').contents[0],
    "html.parser",
).contents[0]
data = json.loads(data)

# uncomment to print all data:
# print(json.dumps(data, indent=4))


def search_biz(d):
    if isinstance(d, dict):
        if "bizId" in d:
            yield d["searchResultBusiness"]
        else:
            for v in d.values():
                yield from search_biz(v)
    elif isinstance(d, list):
        for v in d:
            yield from search_biz(v)


for b in search_biz(data):
    print(b["name"])
    print(
        "Rating: {}\nAddress: {}\nPhone: {}\n".format(
            b["rating"], b["formattedAddress"], b["phone"]
        )
    )

打印：

The Harp
Rating: 4.5
Address: 47 Chandos Place
Phone: 020 7836 0291

Cahoots Bar
Rating: 4.5
Address: 13 Kingly Court
Phone: 020 7352 6200

The Monkey Puzzle
Rating: 4.5
Address: 30 Southwick Street
Phone: 020 7723 0143

The Crobar
Rating: 4.5
Address: 17 Manette Street
Phone: 020 7439 0831

The Queen’s Head
Rating: 4
Address: 15 Denman Street
Phone: 020 7437 1540

The Queens Arms
Rating: 4.5
Address: 11 Warwick Way
Phone: 020 7834 3313

The Cauldron
Rating: 4.5
Address: 79 Stoke Newignton Road
Phone: 0117 456 2442

Coach and Horses
Rating: 4
Address: 5 Bruton Street
Phone: 020 7629 4123

The Victoria
Rating: 4.5
Address: 10a Strathearn Place
Phone: 020 7724 1191

The Ordnance
Rating: 4
Address: 29 Ordnance Hill
Phone: 020 7722 0278

【讨论】：

感谢您的帮助，但我仍然收到错误 - 名称“标题”未定义。此外，我想将这些数据打包到 DataFrame 中并从所有 24 个页面中获取数据，所以我可以使用带范围的 for 循环（就像我在代码顶部使用的那个）还是你会建议其他的东西？
@Wizard 你可以删除headers=headers，我复制错了。没有它也可以工作。
@Wizard 您可以在 URL 中使用 &start={} 参数来遍历结果并将其存储到数据帧中。
此代码是否在您的电脑上工作，因为我收到其他错误 AttributeError: 'Response' object has no attribute 'contents'。
终于成功了，我收到了这个错误，因为阻止了我访问该网络:)