【问题标题】:scrape element which is preceded by specific element using beautifulsoup and css selector instead of lxml and xpath使用 beautifulsoup 和 css 选择器而不是 lxml 和 xpath 来抓取特定元素前面的元素
【发布时间】:2020-08-24 08:23:18
【问题描述】:

我想从这个页面抓取“服务/产品”部分:https://www.yellowpages.com/deland-fl/mip/ryan-wells-pumps-20533306?lid=1001782175490

文本位于 dd 元素内,该元素始终位于该元素之后

服务/产品
我使用 lxml 和 xpath 创建了用于抓取此文本的代码:
import requests
from lxml import html

url = ""
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()

r = session.get(url, timeout=30, headers=headers)
t = html.fromstring(r.content)

products = t.xpath('//dd[preceding-sibling::dt[contains(.,"Services/Products")]]/text()[1]')[0] if t.xpath('//dd[preceding-sibling::dt[contains(.,"Services/Products")]]') else '' 

有没有什么方法可以使用 Beautifulsoup(如果可能的话,还有 css 选择器)而不是 lxml 和 xpath 来获取相同的文本?

【问题讨论】:

    标签: python web-scraping beautifulsoup lxml


    【解决方案1】:

    尝试使用 BeautifulSoup 和 Requests。这要容易得多。 这是一些代码

    # BeautifulSoup is an HTML parser. You can find specific elements in a BeautifulSoup object
    from bs4 import BeautifulSoup
    from requests import get
    
    url = "https://www.yellowpages.com/deland-fl/mip/ryan-wells-pumps-20533306?lid=1001782175490"
    
    
    obj = BeautifulSoup(get(url).content, "html.parser")
    
    # Gets the section with the Services
    buisness_info = obj.find("section", {"id":"business-info"})
    
    # Getting all <dd> elements (cause you can pick off the one you need from the list)
    all_dd = buisness_info.find_all("dd")
    
    # Finds the specific tag with the text you need
    services_and_products = all_dd[2]
    
    # Gets the text
    text = services_and_products.text
    
    # All Done
    print(text)
    

    【讨论】:

    • 我不想按位置获取元素,all_dd[2] 在其他页面上不起作用,因为页面与页面的位置不同
    【解决方案2】:

    在你的页面上尝试这样的事情:

    inf = soup.select_one('section#business-info dl')
    target = inf.find("dt", text='Services/Products').nextSibling
    for t in target.stripped_strings:
        print(t)
    

    输出:

    Pumps|Well Pumps|Residential Pumps|Water Pumps|Residential Pumps|Well Pumps|Residential Pumps|Commercial Pumps|Well Pumps|Pumps & Water Tanks|Residential & Commercial|Residential & Commercial|Water Tanks|Pump Maintenance|Pump Maintenance|Free Estimates|Service & Repair|Emergency Service Avail|Residential & Commercial|Service & Repair|Residential & Commercial|Pumps|Bonded|Insured|Water Tanks|Deep Wells|4 Wells|Pumps & Water Tanks 4'' Wells|2' - 12' Diameter Wells
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-01-10
      • 1970-01-01
      • 2019-08-21
      • 1970-01-01
      • 2010-12-07
      • 2018-03-03
      相关资源
      最近更新 更多