【问题标题】:Scraping elements with the same tag and without class and id attributes抓取具有相同标签但没有类和id属性的元素
【发布时间】:2021-05-22 03:35:42
【问题描述】:

我想从房地产网页中分别获取每个房产的卧室和浴室数量以及土地面积。但是,我发现它们的标签是相同的<strong>,也没有类和id。因此,当我编写以下代码时:

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = "https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1"
response = requests.get(url, headers=headers)
content = BeautifulSoup(response.content, "lxml")

rooms = content.findAll('strong', class_=False, id=False)
for room in rooms:
    print(room.text)

我得到以下信息:

Sign up
2
2
2
2
3
2
4
3
2.4ha
2
1
2
2
4
3
465m2
1
1
3
2
1
1
5
3
10.1ha
3
2
5
5
600m2
600m2
4
2
138m2
2
1
2
1
2
2
3
2
675m2
2
1

您可以看到我将它们全部放在一起,因为它们具有相同的标签。有人可以帮助我如何分别获得它们吗?谢谢!

【问题讨论】:

  • 你能分享一点 HTML 吗?所有这些都可能在一个 div 中尝试定位。

标签: python web-scraping beautifulsoup


【解决方案1】:

Find main tile 表示 div 标签,其中包含有关属性的信息,其中一些数据丢失,如区域、浴室等,因此您可以尝试这种方法!

from bs4 import BeautifulSoup
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = "https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1"
response = requests.get(url, headers=headers)
content = BeautifulSoup(response.content, "lxml")

rooms = content.find_all('div', attrs={'data-test':"tile"})
dict1={}
for room in rooms:
    apart=room.find_all('strong',class_=False)
    if len(apart)==3:
        for apa in apart:
            dict1['bedroom']=apart[0].text
            dict1['bathroom']=apart[1].text
            dict1['area']=apart[2].text

    elif len(apart)==2:
        for apa in apart:
            dict1['bedroom']=apart[0].text
            dict1['bathroom']=apart[1].text
            dict1['area']="NA"
    else:
        for apa in apart:
            dict1['bedroom']="NA"
            dict1['bathroom']="NA"
            dict1['area']=apart[0].text
    print(dict1)

输出:

{'bedroom': '2', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '2', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '3', 'bathroom': '2', 'area': 'NA'}
{'bedroom': '4', 'bathroom': '3', 'area': '2.4ha'}
{'bedroom': '2', 'bathroom': '1', 'area': 'NA'}
...

【讨论】:

  • 嗨 Bhavya 非常感谢您为解决我的问题所付出的时间和努力。它有效且易于理解!
  • 哦,太好了,你接受了我的回答,谢谢!
【解决方案2】:

我会遍历主图块并尝试为每个目标节点进行选择,例如通过其在该图块的 html 中的唯一类。您可以使用 if else with test of not None 在缺少的地方添加默认值。为了处理不同的排序顺序,我还添加了一个 try except。我使用了按最新排序,但也使用您的排序顺序进行了测试。

我添加了更多项目以提供上下文。将其扩展到循环页面很容易,但这超出了您的问题范围,一旦您尝试过扩展(如果需要),它将成为新问题的候选者。

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np

#'https://www.realestate.co.nz/residential/sale/auckland?oad=true&pm=1'

r = requests.get('https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&pm=1',
                  headers = {'User-Agent':'Mozilla/5.0'}).text
soup = bs(r, 'lxml')
main_listings = soup.select('.listing-tile')
base = 'https://www.realestate.co.nz/4016546/residential/sale/'
results = {}

for listing in main_listings:
    
    try:
        date = listing.select_one('.listed-date > span').next_sibling.strip()
    except:
        date = listing.select_one('.listed-date').text.strip()

    title = listing.select_one('h3').text.strip()
    listing_id = listing.select_one('a')['id']
    url = base + listing_id
    
    bedrooms = listing.select_one('.icon-bedroom + strong')
    
    if bedrooms is not None:
        bedrooms = int(bedrooms.text)
    else:
        bedrooms = np.nan
    
    bathrooms = listing.select_one('.icon-bathroom + strong')
    
    if bathrooms is not None:
        bathrooms = int(bathrooms.text)
    else:
        bathrooms = np.nan
    
    land_area = listing.select_one('icon-land-area + strong')
    
    if land_area is not None:
        land_area = land_area.text
    else:
        land_area = "Not specified"
    
    price = listing.select_one('.text-right').text
    
    results[listing_id] = [date, title,  url, bedrooms, bathrooms, land_area, price]
    
df = pd.DataFrame(results).T
df.columns = ['Listing Date', 'Title', 'Url', '#Bedroom', '#Bathrooms', 'Land Area', 'Price']
print(df)

【讨论】:

  • 感谢您的出色解决方案!虽然在这个阶段它可以工作,但我有点难以消化你的代码,因为我是网络抓取的新手(和 python ......不是那么糟糕)。感谢您的时间和努力!
  • 非常欢迎您,感谢您抽出宝贵时间发表评论。欣赏它。
猜你喜欢
  • 2013-07-03
  • 1970-01-01
  • 2023-04-08
  • 2023-02-10
  • 2017-11-26
  • 1970-01-01
  • 1970-01-01
  • 2020-06-18
  • 2015-01-16
相关资源
最近更新 更多