【问题标题】:Zillow scraper: Why I can't scrape full listing from Zillow searchZillow 刮板:为什么我无法从 Zillow 搜索中刮取完整列表
【发布时间】:2021-09-26 12:45:35
【问题描述】:

我正在尝试探索 zillow 住房数据进行分析。但我发现我从 Zillow 上抓取的数据会比列出的要少得多。

有一个例子:

我尝试在 35216 上提取待售房屋: https://www.zillow.com/birmingham-al-35216/?searchQueryState=%7B%22usersSearchTerm%22%3A%2235216%22%2C%22mapBounds%22%3A%7B%22west%22%3A-86.93997505787829%2C%22east%22%3A-86.62926796559313%2C%22south%22%3A33.33562772711966%2C%22north%22%3A33.51819716059094%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A73386%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22sort%22%3A%7B%22value%22%3A%22globalrelevanceex%22%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A13%2C%22pagination%22%3A%7B%7D%7D

我们可以看到有 76 条记录。如果我使用 google chrome 扩展程序:Zillow-to-excel ,列表中的所有 76 个房屋都可以被刮掉。 https://chrome.google.com/webstore/detail/zillow-to-excel/aecdekdgjlncaadbdiciepplaobhcjgi/related

但是当我使用 Python 请求抓取 zillow 数据时,只能抓取 18-20 条记录。 这是我的代码:

import requests
import json
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np

cnt=0
stop_check=0
ele=[]
url='https://www.zillow.com/birmingham-al-35216/'
headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6',
    'upgrade-insecure-requests': '1',
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
for i in range(1,2):
    params = {
    'searchQueryState':'{"pagination":{"currentPage":'+str(i)+'},"usersSearchTerm":"35216","mapBounds":{"west":-86.83314614582643,"east":-86.73781685417354,"south":33.32843303639682,"north":33.511017584543204},"regionSelection":[{"regionId":73386,"regionType":7}],"isMapVisible":true,"filterState":{"sort":{"value":"globalrelevanceex"},"ah":{"value":true}},"isListVisible":true,"mapZoom":13}'
    }
    page=requests.get(url, headers=headers,params=params,timeout=2)
    sp=soup(page.content, 'lxml')
    lst=sp.find_all('address',{'class':'list-card-addr'})
    ele.extend(lst)
    print(i, len(lst))
    if len(lst)==0:
        stop_check+=1
    if stop_check>=3:
        print('stop on three empty')

Headers 和 params 来自使用 chrome 开发工具的 web。我还尝试了其他搜索,发现我只能在每个页面上抓取前 9-11 条记录。

我知道有一个 zillow API,但它可以用于一般搜索,如邮政编码中的所有房屋。所以我想尝试网络抓取。

我可以就如何修复我的代码提供一些建议吗?

非常感谢!

【问题讨论】:

    标签: python web-scraping zillow


    【解决方案1】:

    你可以试试

    import requests
    import json
    
    url = 'https://www.zillow.com/search/GetSearchPageState.htm'
    
    headers = {
        'Accept': '*/*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.9',
        'upgrade-insecure-requests': '1',
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
    }
    
    houses = []
    for page in range(1, 3):
        params = {
            "searchQueryState": json.dumps({
                "pagination": {"currentPage": page},
                "usersSearchTerm": "35216",
                "mapBounds": {
                    "west": -86.97413567189196,
                    "east": -86.57244804982165,
                    "south": 33.346263857015515,
                    "north": 33.48754107532057
                },
                "mapZoom": 12,
                "regionSelection": [
                    {
                        "regionId": 73386, "regionType": 7
                    }
                ],
                "isMapVisible": True,
                "filterState": {
                    "isAllHomes": {
                        "value": True
                    },
                    "sortSelection": {
                        "value": "globalrelevanceex"
                    }
                },
                "isListVisible": True
            }),
            "wants": json.dumps(
                {
                    "cat1": ["listResults", "mapResults"],
                    "cat2": ["total"]
                }
            ),
            "requestId": 3
        }
    
        # send request
        page = requests.get(url, headers=headers, params=params)
    
        # get json data
        json_data = page.json()
    
        # loop via data
        for house in json_data['cat1']['searchResults']['listResults']:
            houses.append(house)
    
    
    # show data
    print('Total houses - {}'.format(len(houses)))
    
    # show info in houses
    for house in houses:
        if 'brokerName' in house.keys():
            print('{}: {}'.format(house['brokerName'], house['price']))
        else:
            print('No broker: {}'.format(house['price']))
    
    Total houses - 76
    RealtySouth-MB-Crestline: $424,900
    eXp Realty, LLC Central: $259,900
    ARC Realty Mountain Brook: $849,000
    Ray & Poynor Properties: $499,900
    Hinge Realty: $1,550,000
    ...
    

    附:如果我帮助你,请不要忘记将答案标记为正确:)

    【讨论】:

    • 非常感谢! @t4kq。请问有没有关于详细搜索查询参数的教程?如果是这样,那就太好了!
    • 只需使用 devtool 即可获得所有问题的答案)
    • 感谢您的帮助!
    猜你喜欢
    • 2019-11-08
    • 2017-10-09
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-11-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多