Python 3 - 电子邮件在 HTML 下载页面中显示为“...”答案

【问题标题】：Python 3 - Email appears as "..." in HTML downloaded pagePython 3 - 电子邮件在 HTML 下载页面中显示为“...”
【发布时间】：2017-01-16 20:33:22
【问题描述】：

我需要从这样的页面获取电子邮件：http://bari.geometriapulia.net/index.php/albo-lista/userprofile/abbatantuono-giuseppe

为此，我使用以下代码：

from bs4 import BeautifulSoup
import urllib.request
import re

url = "http://bari.geometriapulia.net/index.php/albo-lista/userprofile/abbatantuono-giuseppe"

content = urllib.request.urlopen(url).read()
soup = BeautifulSoup(content, "lxml")

for link in soup.find_all("a", href=re.compile(r"^mailto:")):

    if "@" in str(link.string):            
        print(link.string)

此代码找不到我想要的电子邮件，这是您可以在个人资料图片下看到的两个电子邮件，但它会找到放置在页面底部的电子邮件（不是我感兴趣的）。

为了了解原因，我下载了整个 HTML 页面，应该在哪里有电子邮件，你可以阅读邮件应该在哪里的“...”，以及它下面的行中的警告：

<td class="fieldCell" id="cbfv_84"><span class="cbMailRepl" id="cbMa92357">...</span><noscript> 
This e-mail address is protected by spam bot, you must activate JavaScript in you browser in order to visualize it
</noscript>
</td>
</tr>
<tr class="sectiontableentry2 cbft_emailaddress" id="cbfr_97">
<td class="titleCell"><label for="cbfv_97" id="cblabcbfv_97">e-mail:</label></td>
<td class="fieldCell" id="cbfv_97"><span class="cbMailRepl" id="cbMa92358">...</span><noscript> 
 This e-mail address is protected by spam bot, you must activate JavaScript in you browser in order to visualize it

所以我检查了我的 JavaScript 是否在我的浏览器中启用，正如你从这个屏幕截图中看到的那样： http://prntscr.com/dwgl7w

那么我怎样才能在不被反垃圾邮件机器人系统从 HTML 代码中“剪掉”邮件的情况下下载页面呢？这甚至可能吗？

【问题讨论】：

您的浏览器与此协议无关。由于这个页面一开始并不是为了让 python 脚本访问，所以这个文本非常具有误导性。您正在使用的脚本就像用户一样访问并阅读该站点。此时，您的脚本被要求执行某个任务，运行一个小的 javascript 脚本，我猜是一个简单的验证码。由于您的脚本无法运行 js 脚本，因此可以正确检测和处理。

标签： python python-3.x web-scraping beautifulsoup httprequest

【解决方案1】：

电子邮件地址由 JavaScript 生成：

requests 或 urllib 不能处理 JS 代码。使用硒。

【讨论】：

我在页面底部收到邮件，但我想要它上面的邮件，就像我在原始帖子中感到难过一样：prntscr.com/dwrjs7