【问题标题】:How to check if a URL is downloadable in requests如何检查请求中的 URL 是否可下载
【发布时间】:2021-01-19 18:19:17
【问题描述】:

我正在使用 tkinter 和 requests 制作这个下载器应用程序,我最近在我的程序中发现了一个错误。基本上我希望我的程序在开始下载 URL 的内容之前检查给定的 URL 是否可下载。我曾经通过获取 URL 的标头并检查“Content-Length”是否存在来做到这一点,它适用于某些 URL(如:https://www.google.com),但适用于其他 URL(如指向一个 youtube 视频)它没有,它使我的程序崩溃。我看到有人说一个stackoverflow,我可以在标题的“Content-Disposition”中检查“附件”,但它对我不起作用,并且对于可下载和非- 可下载的网址。做这个的最好方式是什么? 我尝试过但没有工作的另一个stackoverflow问题中提到的代码:

import requests
url = 'https://www.google.com'
headers=requests.head(url).headers
downloadable = 'attachment' in headers.get('Content-Disposition', '')

我以前的代码:

headers = requests.head(url, headers={'accept-encoding': ''}).headers
try:
    print(type(headers['Content-Length']))
    file_size = int(headers['Content-Length'])
except KeyError:
    # Just a class that I defined to raise an exception if the URL was not downloadable
    raise NotDownloadable()

更新:网址: https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0 这个网址是我用来测试的。如果您打开 URL,它会直接将您带到可以下载的视频,但是在检查“内容处置”时,它返回“无”,就像我尝试过的大多数可下载和不可下载的 URL 一样。

【问题讨论】:

  • 你能提供一些网址进行测试吗?然后我们就可以调试实际场景了。
  • 只有一种方法可以确保您可以下载文件:您尝试下载它。如果这导致你的程序崩溃,你必须修复它。
  • 我看了你之前关于下载的问题。您有兴趣下载哪些类型的文件?
  • @Lifeiscomplex URL 包含的所有内容。可以是 docx 或 csv 或 html 或任何东西。
  • @OmidKetabollahi 谢谢。如果最终用户可以下载任何内容,那么为什么需要检查页面标题?

标签: python python-3.x url download python-requests


【解决方案1】:

根据Request for Comment (RFC) 6266Content-Disposition Header Field

不是 HTTP 标准的一部分,但由于它被广泛实施, 我们正在为实施者记录它的使用和风险。

由于 Content-Disposition 标头并不总是可用,您可以使用一种解决方案,不仅查找该特定标头,还查看 Content-Type 标头中的各个文件类型

这是Content-Types的列表。

下面的代码检查 Content-Disposition 的标头,但它也检查一些通常可下载的 Content-Type 的标头。

我还添加了对 Content-Length 的检查, 因为它可能有助于对正在下载的文件进行分块。

您是否考虑过创建子下载文件夹?

  • download_folder/text_files
  • download_folder/pdf_files

  • download_folder/01242021/text_files
  • download_folder/01242021/pdf_files
import requests

urls = ['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial'
        '-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-csv.csv',
        'http://www.pdf995.com/samples/pdf.pdf', 'https://jeroen.github.io/files/sample.rtf',
        'https://www.cnn.com/2021/01/23/opinions/biden-climate-change-gillette-wyoming-coal-sutter/index.html',
        'https://www.google.com',
        'https://thumbs-prod.si-cdn.com/d4e3zqOM5KUq8m0m-AFVxuqa5ZM=/800x600/filters:no_upscale():focal(554x699:555x700)/https://public-media.si-cdn.com/filer/a4/04/a404c799-7118-459a-8de4-89e4a44b124f/img_1317.jpg',
        'https://www.blank.org']

for url in urls:
    headers = requests.head(url).headers
    Content_Length = [value for key, value in headers.items() if key == 'Content-Length']
    if len(Content_Length) > 0:
        Content_Size = ''.join(map(str, Content_Length))
    else:
        Content_Size = 'The content size was not available.'


    Content_Disposition_Exists = bool({key: value for key, value in headers.items() if key == 'Content_Disposition'})
    if Content_Disposition_Exists is True:
        # do something with the file
       pass
    else:
        Content_Type = {value for key, value in headers.items() if key == 'Content-Type'}

        compression_formats = ['application/gzip', 'application/vnd.rar', 'application/x-7z-compressed',
                               'application/zip', 'application/x-tar']
        compressed_file = bool([file_format for file_format in compression_formats if file_format in Content_Type])

        image_formats = ['image/bmp', 'image/gif', 'image/jpeg', 'image/png', 'image/svg+xml', 'image/tiff',
                         'image/webp']
        image_file = bool([file_format for file_format in image_formats if file_format in Content_Type])

        text_formats = ['application/rtf', 'text/plain']
        text_file = bool([file_format for file_format in text_formats if file_format in Content_Type])

        if compressed_file is True:
            print('Compressed file')
            print(Content_Size)
        elif image_file is True:
            print('Image file')
            print(Content_Size)
        elif text_file is True:
            print('Text file')
             print(Content_Size)
        elif 'application/pdf' in Content_Type:
            print('PDF file')
            print(Content_Size)
        elif 'text/csv' in Content_Type:
            print('CSV File')
            print(Content_Size)

这是另一个带有函数的版本

import requests

urls = ['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial'
        '-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-csv.csv',
        'http://www.pdf995.com/samples/pdf.pdf', 'https://jeroen.github.io/files/sample.rtf',
        'https://www.cnn.com/2021/01/23/opinions/biden-climate-change-gillette-wyoming-coal-sutter/index.html',
        'https://www.google.com',
        'https://thumbs-prod.si-cdn.com/d4e3zqOM5KUq8m0m-AFVxuqa5ZM=/800x600/filters:no_upscale():focal(554x699:555x700)/https://public-media.si-cdn.com/filer/a4/04/a404c799-7118-459a-8de4-89e4a44b124f/img_1317.jpg',
        'https://www.blank.org']


def query_headers(webpage):
    response = requests.get(webpage, stream=True)
    headers = response.headers
    file_name = webpage.rsplit('/', 1)[-1]

    Content_Disposition_Exists = bool({key: value for key, value in headers.items() if key == 'Content_Disposition'})
    if Content_Disposition_Exists is True:
        # do something with the file
        pass
    else:
        Content_Type = {value for key, value in headers.items() if key == 'Content-Type'}

        compression_formats = ['application/gzip', 'application/vnd.rar', 'application/x-7z-compressed',
                               'application/zip', 'application/x-tar']
        compressed_file = bool([file_format for file_format in compression_formats if file_format in Content_Type])

        image_formats = ['image/bmp', 'image/gif', 'image/jpeg', 'image/png', 'image/svg+xml', 'image/tiff',
                         'image/webp']
        image_file = bool([file_format for file_format in image_formats if file_format in Content_Type])

        text_formats = ['application/rtf', 'text/plain']
        text_file = bool([file_format for file_format in text_formats if file_format in Content_Type])
        nl = '\n'

        if compressed_file is True:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: Compressed file, File size: {content_size}, File name: {file_name}'
        elif image_file is True:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: Image file, File size: {content_size}, File name: {file_name}'
        elif text_file is True:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: Text file, File size: {content_size}, File name: {file_name}'
        elif 'application/pdf' in Content_Type:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: PDF file, File size: {content_size}, File name: {file_name}'
        elif 'text/csv' in Content_Type:
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: CSV file, File size: {content_size}, File name: {file_name}'
        elif 'text/html' in "".join(str(Content_Type)):
            download_file(file_name, response)
            content_size = get_content_size(headers)
            return f'File Information: file_type: HTML file, File size: {content_size}, File name: {file_name}'
        else:
            content_size = get_content_size(headers)
            return f'File Information: file_type:  no file type found, File size: {content_size}, File name: {file_name}'


def get_content_size(headers):
    Content_Length = [value for key, value in headers.items() if key == 'Content-Length']
    if len(Content_Length) > 0:
        Content_Size = ''.join(map(str, Content_Length))
        return int(Content_Size)
    else:
        return 0


def download_file(filename, file_stream):
    with open(f'{filename}', 'wb') as f:
        f.write(file_stream.content)


for url in urls:
    download_info = query_headers(url)
    print(download_info)
    # output
    File Information: file_type: CSV file, File size: 253178, File name: annual-enterprise-survey-2019-financial-year-provisional-csv.csv
    File Information: file_type: PDF file, File size: 433994, File name: pdf.pdf
    File Information: file_type: Text file, File size: 9636, File name: sample.rtf
    File Information: file_type: HTML file, File size: 185243, File name: index.html
    File Information: file_type: HTML file, File size: 0, File name: www.google.com
    File Information: file_type: Image file, File size: 78868, File name: img_1317.jpg
    File Information: file_type: HTML file, File size: 170, File name: www.blank.org

【讨论】:

    【解决方案2】:

    如果 url 中没有给出文件名信息,Content-Disposition 会提供它。但这些信息并不总是存在,就像您的网址一样。一种解决方案是按内容类型进行过滤,请参见下面的示例。如果您想下载特定的内容类型,例如video/mp4,您可以添加过滤器。

    import requests
    
    url = 'https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0'
    headers=requests.head(url, allow_redirects=True).headers
    content_type = headers.get('content-type')
    
    if 'text' in content_type.lower():
        downloadable = False
    elif 'html' in content_type.lower():
        downloadable =  False
    else:
        downloadable = True
    
    print(downloadable)
    

    【讨论】:

      【解决方案3】:

      我认为您以前的代码可以工作,但稍作修改。它正在尝试下载完整的文件,因此每次运行时它都会被挂起

      import requests
      url = 'https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0'
      r = requests.get(url,stream=True)
      
      
      try:
          print(r.headers)
          #if "Content-Length" in r.headers:
          file_size = int(r.headers["Content-Length"])
      except KeyError:
          # Just a class that I defined to raise an exception if the URL was not downloadable
          raise NotDownloadable()
      

      使用stream=True

      r = requests.get(url,stream=True)
      

      这在用户文档中没有解释。但是通过猜测我们可以说,分块传输编码正在进行,数据流被分成一系列不重叠的“块”。这些块由服务器相互独立地发送出去。

      【讨论】:

        【解决方案4】:

        您可以检查content-type 响应标头。此标头定义所请求资源的媒体类型。最常见的类型显示为here

        content-type 标头由type "/" subtype 定义,有些还包含一个参数,格式为type "/" subtype ";" parameter,参数格式为attribute "=" value。参数值不是强制性的,但类型和子类型是强制性的。

        RFC 134定义的当前有7种类型:

        文本 多部分 应用 信息 图片 声音的 视频

        您要查找的标头因您期望的资源而异,但您可以使用一些示例。

        示例

        下载图片

        import requests
        
        response = requests.head(url)
        response_headers = response.headers
        response_content_type = response_headers.get("content-type")
        
        # you could use this code to search for all images using just the type
        
        if response_content_type.lower().split("/")[0] == "image":
            is_image = True
        else:
            is_image = False
        
        # alternatively you could specify your expected content-types including the subtype
        
        CONTENT_TYPES = ["image/gif", "image/jpeg", "image/png", "image/tiff", "image.svg+xml"...]
        
        if response_content_type.lower() in CONTENT_TYPES:
            is_image = True
        else:
            is_image = False
        
        if is_image:
            # code to download image
        

        此代码可以很容易地适应不同的类型和子类型。

        注意

        值得注意的是类型是固定的,你不能定义一个新的子类型但是你可以定义一个新的子类型。

        【讨论】:

          猜你喜欢
          • 2020-08-21
          • 2021-12-13
          • 2017-03-10
          • 2014-07-05
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-06-25
          相关资源
          最近更新 更多