从 Pandas 中的 url 下载 excel 文件（身份验证后）答案

【问题标题】：Downloading excel file from url in pandas (post authentication)从 Pandas 中的 url 下载 excel 文件（身份验证后）
【发布时间】：2020-05-30 07:49:16
【问题描述】：

我面临一个奇怪的问题，由于我缺乏 html 知识，我不太了解。

我想从网站下载一个 excel 文件登录后。 file_url 是：

file_url="https://xyz.xyz.com/portal/workspace/IN AWP ABRL/Reports & Analysis Library/CDI Reports/CDI_SM_Mar'20.xlsx"

文件有一个分享按钮，提供链接2（对于同一文件）：

file_url2='http://xyz.xyz.com/portal/traffic/4a8367bfd0fae3046d45cd83085072a0'

当我使用 requests.get 读取链接 2（登录到会话后）时，我能够将 excel 读入 pandas。但是，链接 2 不符合我的目的，因为我无法定期安排我的报告（通过将 20 年 3 月更改为 20 年 4 月等）。 Link1 适合我的目的，但在 r.content 方法中传递 r=requests.get 时给出以下内容：

b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'

我已经尝试了 url 的所有编码解码，但无法理解这个字母数字 url (link2)。

我的python代码（工作）是：

import requests
url = 'http://xyz.xyz.com/portal/site'
username=''
password=''
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
r = s.get(url,auth=(username, password),verify=False,headers=headers)
r2 = s.get(file_url,verify=False,allow_redirects=True)
r2.content
# df=pd.read_excel(BytesIO(r2.content))

【问题讨论】：

它似乎为您提供了带有 JavaScript 的 HTML，它将页面重定向到您可以在 "top.location.href=..." 中看到的新 URL，但 requests 无法运行 JavaScript。这是阻止一些脚本/机器人的简单方法。您必须从字符串 "top.location.href=..." 中获取 url，并在此 url 中使用下一个 requests.get()。

标签： python python-requests urllib python-requests-html python-responses

【解决方案1】：

您使用 JavaScript 获得 HTML，它将浏览器重定向到新的 url。但是requests 不能运行JavaScript。这是阻止一些简单脚本/机器人的简单方法。

但 HTML 只是字符串，因此您可以使用字符串的函数从字符串中获取 url，并将此 url 与 requests 一起使用来获取文件。

content = b'\n\n\n\n\n\n\n\n\n\n<html>\n\t<head>\n\t\t<title></title>\n\t</head>\n\t\n\t<body bgcolor="#FFFFFF">\n\t\n\n\t<script language="javascript">\n\t\t<!-- \n\t\t\ttop.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar\'20.xlsx";\t\n\t\t-->\n\t</script>\n\t</body>\n</html>'

text = content.decode()
print(text)
print('\n---\n')

start = text.find('href="') + len('href="')
end   = text.find('";', start)

url = text[start:end]
print('url:', url)

response = s.get(url)

结果：

<html>
    <head>
        <title></title>
    </head>

    <body bgcolor="#FFFFFF">


    <script language="javascript">
        <!-- 
            top.location.href="https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx"; 
        -->
    </script>
    </body>
</html>

---

url: https://xyz.xyz.com/portal/workspace/IN%20AWP%20ABRL/Reports%20&%20Analysis%20Library/CDI%20Reports/CDI_SM_Mar'20.xlsx

【讨论】：

工作就像一个魅力！非常感谢！