BeautifulSoup：如何获取标签值汤文本？以及如何遍历 URL 列表？答案

【问题标题】：BeautifulSoup: how to get the tag values soup text? and how to iterate through list of URLs?BeautifulSoup：如何获取标签值汤文本？以及如何遍历 URL 列表？
【发布时间】：2020-07-12 06:01:39
【问题描述】：

我是 python 中美丽的汤/硒的新手，我正在尝试从 URL 列表中获取联系人/电子邮件。网址：

listOfURLs=['https://oooo.com/Number=xxxxx', 'https://oooo.com/Number/yyyyyy', 'https://oooo.com/Number/zzzzzz']

我正在解析的 HTML：

<div class="row classicdiv" id="renderContacInfo">
  <div class="col-md-2" style="word-break: break-word;">
    <h6>Contact</h6>
    <h5>Israa S</h5>
  </div>
  <div class="col-md-2" style="word-break: break-word;">
    <h6>Email</h6>
    <h5>israa.s@xxxx.com <br/>
    </h5>
  </div>
  <div class="col-md-2" style="word-break: break-word;">
    <h6>Alternate Email</h6>
    <h5></h5>
  </div>
  <div class="col-md-2">
    <h6>Primary Phone</h6>
    <h5>1--1</h5>
  </div>
  <div class="col-md-2">
    <h6>Alternate Phone</h6>
    <h5>
    </h5>
  </div>
</div>

我正在尝试循环 URL 列表，但我只能从列表中的第一个 url 中获取 soup。

编写的代码：

driver = webdriver.Chrome(chrome_driver_path)
driver.implicitly_wait(300) 
driver.maximize_window()
driver.get(url)
driver.implicitly_wait(30)
content=driver.page_source
soup=BeautifulSoup(content,'html.parser')
contact_text=soup.findAll("div",{"id":"renderContacInfo"})
output1=''
output2=''
print(contact_text)
time.sleep(100)

for tx in contact_text:
    time.sleep(100)
    output1+=tx.find(text="Email").findNext('h5').text
    output2+=tx.find(text="Contact").findNext('h5').text

我的问题：

如何循环遍历我拥有的列表或 URL？
如何过滤来自soup html 的电子邮件和联系人。
预期输出：

网址联系邮箱

https://oooo.com/Number=xxxxxxxxxxxxx xxxx@xxx.com

https://oooo.com/Number=yyyyyyyyyyyyy yyyy@yyy.com

【问题讨论】：

你需要一个外循环for url in listOfURLs:
@QHarr 我喜欢你关于 url 外循环的建议。我们可以像在这个问题上那样做迭代吗：/60908216/how-to-handle-multiple-urls-in-beautifultsoup-and-convert-the-data-into-datafram/60908470#comment107771591_60908470 这可能是另一种方法. - 我试图在这个问题上关注的一个：stackoverflow.com/questions/60954426/…！？想法！？

标签： python selenium web-scraping beautifulsoup

【解决方案1】：

应该这样做。我删除了所有隐式等待（顺便说一句，如果你想走那条路，你应该在你的脚本顶部设置一次，当你实例化你的 driver 时；它们也很长！）。

listOfURLs=['https://oooo.com/Number=xxxxx', 'https://oooo.com/Number/yyyyyy', 'https://oooo.com/Number/zzzzzz']
result=[]
for url in listOfURLs:
    driver.get(url)
    content = driver.page_source
    soup = BeautifulSoup(content, 'html.parser')
    contact_text = soup.findAll("div", {"id": "renderContacInfo"})

    for tx in contact_text:
        output1=tx.find(text="Contact").findNext('h5').text
        output2=tx.find(text="Email").findNext('h5').text
        output=f"{url} {output1} {output2}"
        result.append(output)

driver.quit()

result 是一个列表，它将以 url + 联系人 + 电子邮件的形式包含所有收集的输出。

【讨论】：

感谢您的回答。它对我有用我唯一注意到的是，当我使用 print(result) 时，我从 result 中得到一个输出，但是当我使用 return result 时，我得到 []，知道为什么特别是列表会发生这种情况吗？跨度>
很高兴它有帮助。确保在函数范围内使用 return result，即将代码包装在函数 def <your code here> return result 中并注意缩进。

【解决方案2】：

正如@QHarr 建议的那样，对 url 使用外循环。使用正则表达式 re 来搜索文本。

import re
listOfURLs=['https://oooo.com/Number=xxxxx', 'https://oooo.com/Number/yyyyyy', 'https://oooo.com/Number/zzzzzz']

for url in listOfURLs:
    driver = webdriver.Chrome(chrome_driver_path)
    driver.maximize_window()
    driver.get(url)
    driver.implicitly_wait(30)
    content = driver.page_source
    soup = BeautifulSoup(content, 'html.parser')
    print(url)
    print(soup.find('h6',text=re.compile("Contact")).find_next('h5').text)
    print(soup.find('h6',text=re.compile("Email")).find_next('h5').text)

【讨论】：

您好，亲爱的 Kunduk - 非常感谢您提供的循环解决方案。这是非常有趣的。 - Mille Grazie - 你的零
你好，亲爱的 KunduK - 非常感谢你的回答：在这个问题中，你展示了我在我的问题中需要的大部分内容 - 在这个网站上可以看到：questions/60954426/writing-a-loop -beautifulsoup-and-lxml-for-getting-page-content-in-a-page-to-pag - 如果你看一下就好了：诸如 a.从一个页面收集几个信息并将它们收集到一个输出中，然后迭代到一个 url 列表。我正在尝试将这些技术应用于我的问题。如果你看看上面提到的这个问题并向我伸出援助之手，我会很高兴。提前谢谢！ - 你的零。
亲爱的 kunduK - 我再次喜欢你的回答，我愿意点击下载投票按钮下方的空心按钮，但我现在看到的只是所谓的时间线。但也许我会找到你的意思并建议我做。 ...也许您对我的问题有一些想法-我只是添加了目标以及针对该问题的目的。非常感谢提前。顺便说一句：这几天我向你学习了。;）