试图从 Selenium 获取 page_source 提供的格式，但有请求答案

【问题标题】：Trying to get the format that page_source from Selenium provides, but with requests试图从 Selenium 获取 page_source 提供的格式，但有请求
【发布时间】：2020-08-19 16:18:44
【问题描述】：

我正在尝试从此处找到的 git 存储库中提取电子邮件：

https://github.com/kyleschiess/Apex/commit/a32f5d426c8c51e41b891b0d35aa860f23c5b11b.patch

通过 Selenium 的解决方案完美运行，其中：

soup = BeautifulSoup(driver.page_source, 'lxml')
y = soup.find('pre')
text = y.text
email = re.findall(r'<(.+?)>',text)
email[0]

给我'38440047+kyleschiess@users.noreply.github.com'

这是因为 y.text 没有删除电子邮件，它位于 '' 之间。

Selenium 一直给我超时问题，所以我宁愿使用请求。

现在，有请求，当我这样做时：

r = requests.get(patchURL)
soup = BeautifulSoup(r.text,'lxml')
y = soup.find('p') #different format for some reason
text = y.text
email = re.findall(r'<(.+?)>',text)
email[0]

我得到“2！”。

我发现，通过请求，将汤转换为文本会删除所有之间的 ''。

使用 Selenium，任何不是位于“”之间的 HTML 标记的东西都放在“<”之间和一个'>' ...所以 .text 不会删除电子邮件。

如何使用 requests 或 urllib 或其他方式获取电子邮件？

【问题讨论】：

标签： html selenium-webdriver beautifulsoup python-requests

【解决方案1】：

我想这就是你要找的：

import requests
import re
url = "https://github.com/kyleschiess/Apex/commit/a32f5d426c8c51e41b891b0d35aa860f23c5b11b.patch"

text = requests.get(url).text
email = re.findall(r'<(.+?)>',text)[0]

print(email)

输出：

38440047+kyleschiess@users.noreply.github.com

另外，我已经计时了，虽然时间会根据您的互联网连接速度而有所不同，但电子邮件会在大约三分之一秒内检索到：

import requests
import re
import time

start = time.time()
url = "https://github.com/kyleschiess/Apex/commit/a32f5d426c8c51e41b891b0d35aa860f23c5b11b.patch"

text = requests.get(url).text
email = re.findall(r'<(.+?)>',text)[0]

print(email)
print(time.time() - start)

输出：

38440047+kyleschiess@users.noreply.github.com
0.28287720680236816

由于该网站不是 HTML（据我所知，它实际上只是纯文本），因此使用 BeautifulSoup 毫无意义。要获取网站的文本，只需运行requests.get(url).text，然后要获取电子邮件，您只需使用正则表达式过滤文本即可。

【讨论】：