【发布时间】:2019-10-10 08:28:51
【问题描述】:
我想从this website 抓取电子邮件,但它们受到保护。它们在网站上可见,但在抓取受保护的电子邮件时会出现已解码。
我尝试过抓取但得到了这个结果
<a href="/cdn-cgi/l/email-protection#d5a7bba695b9a6b0b2fbb6bab8"><span class="__cf_email__" data-cfemail="c0b2aeb380acb3a5a7eea3afad">[email protected]</span></a>
我的代码:
from bs4 import BeautifulSoup as bs
import requests
import re
r = requests.get('https://www.accesswire.com/api/newsroom.ashx')
p = re.compile(r" \$\('#newslist'\)\.after\('(.*)\);")
html = p.findall(r.text)[0]
soup = bs(html, 'lxml')
headlines = [item['href'] for item in soup.select('a.headlinelink')]
for head in headlines:
response2 = requests.get(head, headers=header)
soup2 = bs(response2.content, 'html.parser')
print([a for a in soup2.select("a")])
我想要正文中的电子邮件例如电子邮件:theramedhealthcorp@gmail.com 此电子邮件来自本网站https://www.accesswire.com/546295/Theramed-Provides-Update-on-New-Sales-Channel-for-Nevada-Facility 但是电子邮件受到保护,如何像真实电子邮件地址一样以文本形式抓取它? 谢谢
【问题讨论】:
标签: python selenium email beautifulsoup data-protection