如何获取下载按钮的 url 并在 Python 中读取 CSV 文件？答案

【问题标题】：How to get the url of download button and read the CSV file in Python?如何获取下载按钮的 url 并在 Python 中读取 CSV 文件？
【发布时间】：2021-06-02 22:35:01
【问题描述】：

我正在使用 Python Google Colab 并尝试从此链接读取 csv 文件：https://www.macrotrends.net/stocks/charts/AAPL/apple/stock-price-history

如果您向下滚动一点，您将能够看到下载按钮。我想通过使用 selenium 或 bs 获取链接并读取 csv 文件。我正在尝试做这样的事情，

# install packages
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

# load packages
import pandas as pd
from selenium import webdriver
import sys

# run selenium and read the csv file
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.get('https://www.macrotrends.net/stocks/charts/AAPL/apple/stock-price-history')#put here the adress of your page
btn = driver.find_element_by_tag_name('button')
btn.click()
df = pd.read_csv('##.csv')

它似乎一直有效，直到btn.click() 部分，但之后出现错误，因为它没有告诉我下载按钮的链接或文件名。你能帮忙吗？那将不胜感激。

【问题讨论】：

您遇到了什么错误？请添加堆栈回溯。
@PatrickKlein btn.click() 没有做任何事情。我刚刚检查了 chitown88 方法是否完美。

标签： python selenium csv selenium-webdriver web-scraping

【解决方案1】：

不需要硒。数据嵌入在<script> 标签中。

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd

t = 'AAPL'
url = 'https://www.macrotrends.net/assets/php/stock_price_history.php?t={}'.format(t)

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

scripts = soup.find_all('script',{'type':'text/javascript'})
for script in scripts:
    if 'var dataDaily' in str(script):
        jsonStr = '[' + str(script).split('[',1)[-1].split('];')[0] + ']'
        jsonData = json.loads(jsonStr)
        
df = pd.DataFrame(jsonData)
df = df.rename(columns={'o':'open','h':'high','l':'low','c':'close','d':'date','v':'volume'})
df.to_csv('MacroTrends_Data_Download_{}.csv'.format(t), index=False)

输出：

print(df)
             date      open      high  ...   volume     ma50    ma200
0      1980-12-12    0.1012    0.1016  ...  469.034      NaN      NaN
1      1980-12-15    0.0964    0.0964  ...  175.885      NaN      NaN
2      1980-12-16    0.0893    0.0893  ...  105.728      NaN      NaN
3      1980-12-17    0.0910    0.0915  ...   86.442      NaN      NaN
4      1980-12-18    0.0937    0.0941  ...   73.450      NaN      NaN
          ...       ...       ...  ...      ...      ...      ...
10135  2021-02-25  124.6800  126.4585  ...  148.200  131.845  112.241
10136  2021-02-26  122.5900  124.8500  ...  164.560  131.838  112.460
10137  2021-03-01  123.7500  127.9300  ...  116.308  131.840  112.716
10138  2021-03-02  128.4100  128.7200  ...  102.261  131.790  112.957
10139  2021-03-03  124.8100  125.7100  ...  111.514  131.661  113.184

[10140 rows x 8 columns]

【讨论】：

这是您的回答中的错字吗？ for script in scripts: script=scripts[5] ...
啊拍摄。不一定是错字，但在我测试/调试时就在那里。忘记拿出来了。我会解决的。感谢您了解这一点。
非常感谢您的回答。我正在尝试申请其他网站。你能解释一下'for script in scripts'循环下面的行吗？我尝试在示例 URL 的源 html 中搜索“var dataDaily”但找不到它，因此我无法弄清楚 if 语句在做什么。谢谢！
此数据专门在<script> 标记中以json 格式找到。在它下面找到的 javascript 变量是var dataDaily，因此我用它拉出字符串。这是特定于该特定站点的。其他网站可能没有 <script> 标记中的数据，如果有，可能会将其存储在不同的变量名下。