【问题标题】:Python- Downloading a file from a webpage by clicking on a linkPython-通过单击链接从网页下载文件
【发布时间】:2017-03-25 18:35:31
【问题描述】:

我已经在互联网上寻找解决方案,但似乎没有一个真正适用于此。我正在编写一个 Python 程序来使用历史数据预测第二天的股价。我不需要雅虎财务提供的自成立以来的所有历史数据,而只需要最近 60 天左右的数据。纳斯达克网站提供了恰到好处的历史数据,我想使用那个网站。

我想要做的是,转到特定股票在纳斯达克的资料。例如:(www.nasdaq.com/symbol/amd/historical) 并单击最底部的“以 Excel 格式下载此文件”链接。我检查了页面的 HTML,看看是否有一个实际的链接,我可以使用 urllib 来获取文件,但我得到的只是:

<a id="lnkDownLoad" href="javascript:getQuotes(true);">
                Download this file in Excel Format
            </a>

没有链接。所以我的问题是,我如何编写一个 Python 脚本来访问给定股票的纳斯达克页面,单击以 excel 格式下载文件链接并实际从中下载文件。大多数在线解决方案都要求您知道存储文件的 url,但在这种情况下,我无权访问。那么我该怎么做呢?

【问题讨论】:

  • 谷歌“python selenium”
  • 从我所读到的内容来看,它可以帮助我与网络浏览器进行交互,因为它实际上会打开网络浏览器。我想在文件简单地下载到指定目的地的幕后做更多的事情
  • 只需要默认提供的3个月数据吗?
  • 如果你使用PhantomJS webdriver,它不需要打开浏览器。
  • @BillBell 是的,只有最近 3 个月。做一些机器学习,这样 20 年前的数据就没那么有用了。需要更多当前数据

标签: python html


【解决方案1】:
  1. 使用 Chrome,转到 View &gt; Developer &gt; Developer Tools
  2. 在这个新的开发者工具 UI 中,切换到 Network 选项卡
  3. 导航到您需要单击的位置,然后单击 ⃠ 符号以清除所有最近的活动。
  4. 点击链接,查看是否有对服务器的请求
  5. 如果有,点击它,看看是否可以对其端点的 API 进行逆向工程

请注意,这可能违反网站的服务条款!

【讨论】:

    【解决方案2】:

    BeautifulSoup 似乎是最简单的方法。我粗略地检查了以下脚本的结果是否与页面上显示的结果相同。您只需将结果写入文件,而不是打印它们。但是,列的顺序不同。

    import requests
    from bs4 import BeautifulSoup
    
    URL = 'http://www.nasdaq.com/symbol/amd/historical'
    page = requests.get(URL).text
    soup = BeautifulSoup(page, 'lxml')
    tableDiv = soup.find_all('div', id="historicalContainer")
    tableRows = tableDiv[0].findAll('tr')
    
    for tableRow in tableRows[2:]:
        row = tuple(tableRow.getText().split())
        print ('"%s",%s,%s,%s,%s,"%s"' % row)
    

    输出:

    "03/24/2017",14.16,14.18,13.54,13.7,"50,022,400"
    "03/23/2017",13.96,14.115,13.77,13.79,"44,402,540"
    "03/22/2017",13.7,14.145,13.55,14.1,"61,120,500"
    "03/21/2017",14.4,14.49,13.78,13.82,"72,373,080"
    "03/20/2017",13.68,14.5,13.54,14.4,"91,009,110"
    "03/17/2017",13.62,13.74,13.36,13.49,"224,761,700"
    "03/16/2017",13.79,13.88,13.65,13.65,"44,356,700"
    "03/15/2017",14.03,14.06,13.62,13.98,"55,070,770"
    "03/14/2017",14,14.15,13.6401,14.1,"52,355,490"
    "03/13/2017",14.475,14.68,14.18,14.28,"72,917,550"
    "03/10/2017",13.5,13.93,13.45,13.91,"62,426,240"
    "03/09/2017",13.45,13.45,13.11,13.33,"45,122,590"
    "03/08/2017",13.25,13.55,13.1,13.22,"71,231,410"
    "03/07/2017",13.07,13.37,12.79,13.05,"76,518,390"
    "03/06/2017",13,13.34,12.38,13.04,"117,044,000"
    "03/03/2017",13.55,13.58,12.79,13.03,"163,489,100"
    "03/02/2017",14.59,14.78,13.87,13.9,"103,970,100"
    "03/01/2017",15.08,15.09,14.52,14.96,"73,311,380"
    "02/28/2017",15.45,15.55,14.35,14.46,"141,638,700"
    "02/27/2017",14.27,15.35,14.27,15.2,"95,126,330"
    "02/24/2017",14,14.32,13.86,14.12,"46,130,900"
    "02/23/2017",14.2,14.45,13.82,14.32,"79,900,450"
    "02/22/2017",14.3,14.5,14.04,14.28,"71,394,390"
    "02/21/2017",13.41,14.1,13.4,14,"66,250,920"
    "02/17/2017",12.79,13.14,12.6,13.13,"40,831,730"
    "02/16/2017",13.25,13.35,12.84,12.97,"52,403,840"
    "02/15/2017",13.2,13.44,13.15,13.3,"33,655,580"
    "02/14/2017",13.43,13.49,13.19,13.26,"40,436,710"
    "02/13/2017",13.7,13.95,13.38,13.49,"57,231,080"
    "02/10/2017",13.86,13.86,13.25,13.58,"54,522,240"
    "02/09/2017",13.78,13.89,13.4,13.42,"72,826,820"
    "02/08/2017",13.21,13.75,13.08,13.56,"75,894,880"
    "02/07/2017",14.05,14.27,13.06,13.29,"158,507,200"
    "02/06/2017",12.46,13.7,12.38,13.63,"139,921,700"
    "02/03/2017",12.37,12.5,12.04,12.24,"59,981,710"
    "02/02/2017",11.98,12.66,11.95,12.28,"116,246,800"
    "02/01/2017",10.9,12.14,10.81,12.06,"165,784,500"
    "01/31/2017",10.6,10.67,10.22,10.37,"51,993,490"
    "01/30/2017",10.62,10.68,10.3,10.61,"37,648,430"
    "01/27/2017",10.6,10.73,10.52,10.67,"32,563,480"
    "01/26/2017",10.35,10.66,10.3,10.52,"35,779,140"
    "01/25/2017",10.74,10.975,10.15,10.35,"61,800,440"
    "01/24/2017",9.95,10.49,9.95,10.44,"43,858,900"
    "01/23/2017",9.68,10.06,9.68,9.91,"27,848,180"
    "01/20/2017",9.88,9.96,9.67,9.75,"27,936,610"
    "01/19/2017",9.92,10.25,9.75,9.77,"46,087,250"
    "01/18/2017",9.54,10.1,9.42,9.88,"51,705,580"
    "01/17/2017",10.17,10.23,9.78,9.82,"70,388,000"
    "01/13/2017",10.79,10.87,10.56,10.58,"38,344,340"
    "01/12/2017",10.98,11.0376,10.33,10.76,"75,178,900"
    "01/11/2017",11.39,11.41,11.15,11.2,"39,337,330"
    "01/10/2017",11.55,11.63,11.33,11.44,"29,122,540"
    "01/09/2017",11.37,11.64,11.31,11.49,"37,215,840"
    "01/06/2017",11.29,11.49,11.11,11.32,"34,437,560"
    "01/05/2017",11.43,11.69,11.23,11.24,"38,777,380"
    "01/04/2017",11.45,11.5204,11.235,11.43,"40,742,680"
    "01/03/2017",11.42,11.65,11.02,11.43,"55,114,820"
    "12/30/2016",11.7,11.78,11.25,11.34,"44,033,460"
    "12/29/2016",11.24,11.62,11.01,11.59,"50,180,310"
    "12/28/2016",12.28,12.42,11.46,11.55,"71,072,640"
    "12/27/2016",11.65,12.08,11.6,12.07,"44,168,130"
    

    脚本转义日期和千位分隔的数字。

    【讨论】:

    • 完美!正是我想要的。我不知道为什么我不考虑从表格本身获取信息,而不是专注于下载文件。谢谢。
    【解决方案3】:

    再深入一点,找出 js 函数 getQuotes() 的作用。你应该从中得到一个很好的线索。

    如果这一切看起来太复杂了,那么您总是可以使用 selenium。它用于模拟浏览器。但是,它比使用本地网络调用要慢得多。你可以找到官方文档here

    【讨论】:

      猜你喜欢
      • 2017-12-24
      • 2014-06-21
      • 1970-01-01
      • 2013-11-27
      • 2012-07-01
      • 1970-01-01
      • 1970-01-01
      • 2020-08-20
      • 1970-01-01
      相关资源
      最近更新 更多