【问题标题】:Cannot find the desired link for download (Python BeautifulSoup)找不到所需的下载链接(Python BeautifulSoup)
【发布时间】:2017-06-28 17:33:14
【问题描述】:

我是 Python Beautiful Soup 的新手,我对 html 或 js 了解不多。我尝试使用 bs4 下载此page 中的所有 xls 文件,但似乎 bs4 在“附件”部分下找不到链接。有人可以帮帮我吗?

我当前的代码是:

"""
Scrapping of all county-level raw data from 
http://www.countyhealthrankings.org for all years. Data stored in RawData 
folder.
Code modified from https://null-byte.wonderhowto.com/how-to/download-all-
pdfs-webpage-with-python-script-0163031/
"""

from bs4 import BeautifulSoup 
import urlparse
import urllib2
import os
import sys

"""
Get all links
"""
def getAllLinks(url):
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read(),"html.parser")
    links = soup.find_all('a', href=True)
    return links

def download(links):
    for link in links:
        #raw_input("Press Enter to continue...")
        #print link
        #print "------------------------------------"
        #print os.path.splitext(os.path.basename(link['href']))
        #print "------------------------------------"
        #print os.path.splitext(os.path.basename(link['href']))[1]
        suffix = os.path.splitext(os.path.basename(link['href']))[1]
        if os.path.splitext(os.path.basename(link['href']))[1] == '.xls':
            print link #cannot find anything
            currentLink = urllib2.urlopen(link)

links = 
getAllLinks("http://www.countyhealthrankings.org/app/iowa/2017/downloads")
download(links)

(顺便说一下,我想要的链接看起来像this。)

谢谢!

【问题讨论】:

    标签: web-scraping beautifulsoup data-science


    【解决方案1】:

    这似乎是 BeautifulSoup(至少本身)无法完成的任务之一。但是,您可以使用 selenium 来做到这一点。

    >>> from selenium import webdriver
    >>> driver = webdriver.Chrome()
    >>> driver.get('http://www.countyhealthrankings.org/app/iowa/2017/downloads')
    >>> links = driver.find_elements_by_xpath('.//span[@class="file"]/a')
    >>> len(links)
    30
    >>> for link in links:
    ...     link.get_attribute('href')
    ...     
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/CHR2017_IA.pdf'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2017%20County%20Health%20Rankings%20Iowa%20Data%20-%20v1.xls'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2017%20Health%20Outcomes%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2017%20Health%20Factors%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/CHR2016_IA.pdf'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2016%20County%20Health%20Rankings%20Iowa%20Data%20-%20v3.xls'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2016%20Health%20Outcomes%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2016%20Health%20Factors%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/CHR2015_IA.pdf'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2015%20County%20Health%20Rankings%20Iowa%20Data%20-%20v3.xls'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2015%20Health%20Outcomes%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2015%20Health%20Factors%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/CHR2014_IA_v2.pdf'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2014%20County%20Health%20Rankings%20Iowa%20Data%20-%20v6.xls'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2014%20Health%20Outcomes%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2014%20Health%20Factors%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/states/CHR2013_IA.pdf'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2013%20County%20Health%20Ranking%20Iowa%20Data%20-%20v1_0.xls'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2013%20Health%20Outcomes%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2013%20Health%20Factors%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/states/CHR2012_IA.pdf'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2012%20County%20Health%20Ranking%20Iowa%20Data%20-%20v2.xls'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2012%20Health%20Outcomes%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2012%20Health%20Factors%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/states/CHR2011_IA.pdf'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2011%20County%20Health%20Ranking%20Iowa%20Data%20-%20v2.xls'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2011%20Health%20Outcomes%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2011%20Health%20Factors%20-%20Iowa.png'
    'http://www.countyhealthrankings.org/sites/default/files/states/CHR2010_IA_0.pdf'
    'http://www.countyhealthrankings.org/sites/default/files/state/downloads/2010%20County%20Health%20Ranking%20Iowa%20Data%20-%20v2.xls'
    

    【讨论】:

    • 谢谢比尔。这似乎有效!只是好奇,你知道为什么 BeautifulSoup 在这种情况下效果不佳吗?
    • 这应该在我的回答中。我很怀疑,因为您的代码看起来不错。我尝试使用 BeautifulSoup 来查找页面上的所有链接并打印出它们的 href。这些都不是我们想要的,这表明该页面很可能使用 Ajax 来加载自己的内容。这实际上是当今的常态。您仍然可以使用 BeautifulSoup,但通常您必须使用 selenium 等产品的功能加载页面的 DOM。 BeautifulSoup 无法处理 HTML 中未加载的内容。
    • 嗯,好的。谢谢:))
    猜你喜欢
    • 2017-06-25
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-04-28
    • 2011-04-11
    • 2017-08-21
    • 1970-01-01
    相关资源
    最近更新 更多