【问题标题】:Beautifullsoup get text in tagBeautifulsoup 在标签中获取文本
【发布时间】:2019-02-16 22:41:45
【问题描述】:

我正在尝试从黄页中获取数据,但我只需要编号的管道工。但我无法在 h2 class='n' 中获取文本编号。我可以获得 class="business-name" 文本,但我只需要编号的管道工而不需要广告。我的错误是什么?非常感谢。

这个html:

<div class="info">
   <h2 class="n">1.&nbsp;<a class="business-name" href="/austin-tx/mip/johnny-rooter-11404675?lid=171372530" rel="" data-impressed="1"><span>Johnny Rooter</span></a></h2>
</div>

这是我的python代码:

import requests
from bs4 import BeautifulSoup as bs

url = "https://www.yellowpages.com/austin-tx/plumbers"
req = requests.get(url)
data = req.content
soup = bs(data, "lxml")
links = soup.findAll("div", {"class": "info"})

for link in links:
        for content in link.contents:
            try:
                print(content.find("h2", {"class": "n"}).text)
            except:
                pass

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    您需要一个不同的类选择器来限制该部分

    import requests
    from bs4 import BeautifulSoup as bs
    
    url = "https://www.yellowpages.com/austin-tx/plumbers"
    req = requests.get(url)
    data = req.content
    soup = bs(data, "lxml")
    links = [item.text.replace('\xa0','') for item in soup.select('.organic h2')]
    print(links)
    

    .organic 是来自复合类的单个类选择器,用于限制所有编号管道工的父元素。观察广告之后突出显示是如何开始的:


    输出:

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-07-22
      • 1970-01-01
      • 2016-03-26
      • 2020-01-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多