无法获取电话号码和地址答案

【问题标题】：Unable to grab phone number and address无法获取电话号码和地址
【发布时间】：2022-01-07 15:28:22
【问题描述】：

我已经抓取了标题和网站链接，但我无法提取电话号码和地址。我怎样才能得到它们？

这是我所拥有的：

import re
import requests
from bs4 import BeautifulSoup

url='https://www.constructionplacements.com/top-construction-companies-in-india/'
req=requests.get(url)

soup =BeautifulSoup(req.content,'lxml')

for h4 in soup.find_all(lambda tag: tag.name=='h4' and re.search(r'^\d+\.',tag.text)):
    title=h4.text
    website=h4.find_next('a')['href']

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

你可能想试试这个：

注意：并非所有公司都有电话号码。

import requests
from bs4 import BeautifulSoup


def extractor(search_for: str) -> list:
    return [
        p.getText() for p in soup if p.getText(strip=True).startswith(search_for)
    ]


url = 'https://www.constructionplacements.com/top-construction-companies-in-india/'
soup = BeautifulSoup(requests.get(url).text, "lxml").select(".post p")

phone_numbers = extractor("Phone")
addresses = extractor("Address")

print(len(phone_numbers), len(addresses))

输出：

62 70

这是做什么的

def extractor(search_for: str) -> list:
    return [
        p.getText() for p in soup if p.getText(strip=True).startswith(search_for)
    ]

基本上是在post 部分中遍历<p> 的所有元素，如果p.getText() 以给定短语search_for 开头，它会抓取该元素p 并提取其文本值。

该逻辑适用于以Phone 或Address 开头的段落。

【讨论】：

p.getText() for p in soup if p.getText(strip=True).startswith(search_for) 你能解释一下这条线，以便其他人可以从你的答案中获得帮助。它如何确切地知道电话号码和地址的位置？如何避免在电话号码和地址之前和之后抓取其他人的文本？
@boyenec 我已经更新了答案。