无法从网站上抓取受保护的电子邮件答案

【问题标题】：Cannot scrape protected email from website无法从网站上抓取受保护的电子邮件
【发布时间】：2019-10-10 08:28:51
【问题描述】：

我想从this website 抓取电子邮件，但它们受到保护。它们在网站上可见，但在抓取受保护的电子邮件时会出现已解码。

我尝试过抓取但得到了这个结果

<a href="/cdn-cgi/l/email-protection#d5a7bba695b9a6b0b2fbb6bab8"><span class="__cf_email__" data-cfemail="c0b2aeb380acb3a5a7eea3afad">[email protected]</span></a>

我的代码：

from bs4 import BeautifulSoup as bs
import requests
import re


r = requests.get('https://www.accesswire.com/api/newsroom.ashx')
p = re.compile(r" \$\('#newslist'\)\.after\('(.*)\);")
html = p.findall(r.text)[0]
soup = bs(html, 'lxml')
headlines = [item['href'] for item in soup.select('a.headlinelink')]

for head in headlines:
        response2 = requests.get(head, headers=header)
        soup2 = bs(response2.content, 'html.parser')

        print([a for a in soup2.select("a")])

我想要正文中的电子邮件例如电子邮件：theramedhealthcorp@gmail.com 此电子邮件来自本网站https://www.accesswire.com/546295/Theramed-Provides-Update-on-New-Sales-Channel-for-Nevada-Facility 但是电子邮件受到保护，如何像真实电子邮件地址一样以文本形式抓取它？谢谢

【问题讨论】：

标签： python selenium email beautifulsoup data-protection

【解决方案1】：

我首先尝试了您的代码，我也得到了 [电子邮件保护]

然后我意识到网站可能正在通过 JavaScript 加载这些数据。

您可以使用 selenium 或任何轻量级浏览器完成工作。

我使用 PyQt5 库打开页面，就像它在支持 JavaScript 的浏览器中打开一样，然后我从中获取源代码并执行正常的 BeautifulSoup 代码。

先决条件安装命令（如果您是 windows 用户）：

要安装 PyQt5：pip install pyqt5

PyQt5 windows 发行版没有 PyQtWebEngine 我们需要单独安装：

pip install PyQtWebEngine

为了使用 pyqt4 渲染基于 JavaScript 的页面，我在这里关注了 SentDex 的视频：https://www.youtube.com/watch?v=FSH77vnOGqU

但它适用于 pyqt4。要从 pyqt4 过渡到 pyqt5，这个 StackOverflow 答案帮助了我：

https://stackoverflow.com/a/44432380/8810517

我的代码：

import requests
import re
from bs4 import BeautifulSoup as bs

import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage

class Client(QWebEnginePage):
    def __init__(self,url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)

        self.html=""
        self.loadFinished.connect(self.on_page_load)

        self.load(QUrl(url))
        self.app.exec_()

    def on_page_load(self):
        self.html=self.toHtml(self.Callable)
        print("In on_page_load \n \t HTML: ",self.html)

    def Callable(self,html_str):
        print("In Callable \n \t HTML_STR: ",len(html_str))
        self.html=html_str
        print("In Callable \n \t HTML_STR: ",len(self.html))
        self.app.quit()

url="https://www.accesswire.com/546227/InterRent-Announces-Voting-Results-from-the-2019-Annual-and-Special-Meeting"

client_response= Client(url)

soup = bs(client_response.html, 'html.parser')
table = soup.find_all('table')
#print(len(table))
table = table[len(table)-1]
#print(table)
a = table.find_all('a')
#print(len(a))
for i in a:
    print(i.text)

输出：

mmcgahan@interrentreit.com
bcutsey@interrentreit.com
cmillar@interrentreit.com

【讨论】：