问题遍历可能的 URL 列表以下载文件答案

【问题标题】：Issue iterating through a list of possible URLs to download files问题遍历可能的 URL 列表以下载文件
【发布时间】：2014-01-14 11:01:44
【问题描述】：

我使用 BeautifulSoup 和 urllib 编写了一个脚本，它遍历 URL 列表并下载某些文件类型的项目。

我遍历一个 URL 列表，从每个 URL 中创建一个汤对象并解析链接。

我遇到的问题是我发现有时源中的链接是不同的，即使我正在处理的所有链接都在同一个网站中。例如，有时会是 '/dir/pdfs/file.pdf' 或 'pdf/file.pdf' 或 '/pdfs/file.pdf'。

所以，如果有完整的 URL，urlretrieve() 知道如何处理它，但如果它只是上面列出的子目录，它会返回错误。我当然可以手动点击源链接，但urlretrieve() 不知道如何处理它，所以我必须在urlretrieve() 中添加一个基本 URL（如www.example.com/ 或www.example.com/dir/）打电话。

我在创建这样一种情况时遇到了麻烦所以我可以手动抓取它。

有人能指出我正确的方向吗？

URLs = []
BASEURL = []
FILETYPE = ['\.pdf$','\.ppt$', '\.pptx$', '\.doc$', 
            '\.docx$', '\.xls$', '\.xlsx$', '\.wmv$']

def main():
for link in soup.findAll(href = compile(types)):
    file = link.get('href')
    filename = file.split('/')[-1]

    urlretrieve(filename)
    print file

if __name__ == "__main__":
for url in URLs:
    html_data = urlopen(url)
    soup = BeautifulSoup(html_data)

    for types in FILETYPE:
        main()

【问题讨论】：

标签： python beautifulsoup urllib

【解决方案1】：

更好的选择是构建正确的绝对 URL 开始：

def main(soup, domain, path, types):
    for link in soup.findAll(href = compile(types)):
        file = link.get('href')

        # Make file URL absolute here
        if '://' not in file and not file.startswith('//'):
            if not file.startswith('/'):
                file = urlparse.urljoin(path, file)
            file = urlparse.urljoin(domain, file)

        try:
            urlretrieve(file)
        except:
            print 'Error retrieving %s using URL %s' % (
                link.get('href'), file)

for url in URLs:
    html_data = urlopen(url)
    soup = BeautifulSoup(html_data)

    urlinfo = urlparse.urlparse(url)
    domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
    path = urlinfo.path.rsplit('/', 1)[0]

    for types in FILETYPE:
        main(soup, domain, path, types)

urlparse 函数用于将源 URL 拆分为两段：domain 包含 URI 方案和域名，path 包含服务器上目标文件的“目录”。例如：

>>> url = "http://www.example.com/some/web/page.html"
>>> urlinfo = urlparse.urlparse(url)
>>> urlinfo
ParseResult(scheme='http', netloc='www.example.com',
            path='/some/web/page.html', params='', query='', fragment='')
>>> domain = urlparse.urlunparse((urlinfo.scheme, urlinfo.netloc, '', '', '', ''))
>>> domain
'http://www.example.com'
>>> path = urlinfo.path.rsplit('/', 1)[0]
>>> path
'/some/web'

然后将domain 和path 用作遇到的href 的基本路径：

如果 href 包含 "://" 或以 "//" 开头，则假定它是绝对的：无需修改，
否则，如果 href 以 "/" 开头，则它是相对于域：前置域，
否则 href 是相对于路径的：添加域和基本路径。

【讨论】：

这很好用，谢谢你的提示，除了我需要确保将filename 变量添加到urlretrieve 元组中，以便文件保存到我的目录中。您能否解释一下domain 变量中实际发生的情况？再次感谢。
@asdoylejr 我已经添加了一些关于域/路径构造的细节，这有帮助吗？

【解决方案2】：

Assumin 下载方法将下载文件，如果下载成功则返回 True，如果下载失败则返回 False...然后这会遍历 urls 和 files 给出的所有可能的文件路径。

def download(url, file):
    print url + file;
    //assuming download failed, returning False, so it will loop through all the files for this demo purpose.
    return False;

def main():
    urls = ["example.com/", "example.com/docs/", "example.com/dir/docs/", "example.com/dir/doocs/files/"]

    files = ["file1.pdf", "file2.pdf", "file3.pdf"]

    for file in files:
        for url in urls:
            success = download(url, file, False)
            if success:
                 break


main()

【讨论】：

【解决方案3】：

您需要捕获异常并尝试下一个基本 url。也就是说，您也可以在发出请求之前尝试make the links absolute。我相信这是最好的方法，因为它避免了很多不必要的请求。 lxml has a handy make_links_absolute() function 用于此目的。

还可以查看urlparse.urljoin。继续使用您已经使用的方法...

html_data = urlopen(url)
soup = BeautifulSoup(html_data)
for link in soup.findAll(href = compile(types)):
    file = link.get('href')
    for domain in (url, 'http://www.one.com', 'http://www.two.com'):
        path = urlparse.urljoin(domain, file)
        try:
            req = urllib.urlretrieve(url)
            break  # stop trying new domains
        except:
            print 'Error downloading {0}'.format(url)
            # will go to the next domain

如果我使用 lxml 执行此操作，它会是这样的：

req = urlopen(url)
html = req.read()
root = lxml.html.fromstring(root)
root.make_links_absolute()  # automatically add the domain to the links
for a in root.iterlinks():
    if a[2].endswith('pdf'):
        # download link ending with pdf
        req = urlretrieve(a[2])

【讨论】：