使用 urllib 下载和处理文件 (.tar.gz) 并请求 package-python答案

【问题标题】：File (.tar.gz) download and processing using urlib and requests package-python使用 urllib 下载和处理文件 (.tar.gz) 并请求 package-python
【发布时间】：2019-12-06 00:18:43
【问题描述】：

范围： 使用哪个库？ urllib 与请求我试图下载一个 url 上可用的日志文件。 URL 托管在 aws 并包含文件名。访问该 url 后，它会提供一个 .tar.gz 文件以供下载。我需要将这个文件下载到我选择的目录中解压缩并解压缩以到达其中的 json 文件，最后解析 json 文件。在互联网上搜索时，我发现零星的信息遍布整个地方。在这个问题中，我尝试将其合并到一个地方。

【问题讨论】：

标签： django python-3.x python-requests urllib contextmanager

【解决方案1】：

使用请求库： 一个 PyPi 包，在处理高 http 请求时被认为是优越的。参考：

代码：

import requests
import urllib.request
import tempfile
import shutil
import tarfile
import json
import os
import re

with requests.get(respurl,stream = True) as File:
    # stream = true is required by the iter_content below
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        with open(tmp_file.name,'wb') as fd:
            for chunk in File.iter_content(chunk_size=128):
                fd.write(chunk)

with tarfile.open(tmp_file.name,"r:gz") as tf:
    # To save the extracted file in directory of choice with same name as downloaded file.
    tf.extractall(path)
    # for loop for parsing json inside tar.gz file.
    for tarinfo_member in tf:
        print("tarfilename", tarinfo_member.name, "is", tarinfo_member.size, "bytes in size and is", end="")
        if tarinfo_member.isreg():
            print(" a regular file.")
        elif tarinfo_member.isdir():
            print(" a directory.")
        else:
            print(" something else.")
        if os.path.splitext(tarinfo_member.name)[1] == ".json":
            print("json file name:",os.path.splitext(tarinfo_member.name)[0])
            json_file = tf.extractfile(tarinfo_member)
            # capturing json file to read its contents and further processing.
            content = json_file.read()
            json_file_data = json.loads(content)
            print("Status Code",json_file_data[0]['status_code'])
            print("Response Body",json_file_data[0]['response'])
            # Had to decode content again as it was double encoded.
            print("Errors:",json.loads(json_file_data[0]['response'])['errors'])

将提取的文件保存在选择的目录中，同名下载的文件。变量'path'的形成如下。

其中 url 示例包含文件名“44301621eb-response.tar.gz”

https://yoursite.com/44301621eb-response.tar.gz?AccessKeyId=your_id&Expires=1575526260&Signature=you_signature

BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
PROJECT_NAME = 'your_project_name'
PROJECT_ROOT = os.path.join(BASE_DIR, PROJECT_NAME)
LOG_ROOT = os.path.join(PROJECT_ROOT, 'log')
filename = re.split("([^?]+)(?:.+/)([^#?]+)(\?.*)?", respurl)
# respurl is the url from the where the file will be downloaded 
path = os.path.join(LOG_ROOT,filename[2])

regex101.com 的正则表达式匹配输出

与 urllib 比较

为了了解细微差别，我也使用 urllib 实现了相同的代码。

注意 tempfile 库的用法略有不同为我工作。我不得不在请求的地方使用shutil库和urllib 由于差异，无法使用 shutil 库 copyfileobj 方法我们使用 urllib 和 requests 获得的响应对象。

with urllib.request.urlopen(respurl) as File:
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        shutil.copyfileobj(File, tmp_file)

with tarfile.open(tmp_file.name,"r:gz") as tf:
    print("Temp tf File:", tf.name)
    tf.extractall(path)
    for tarinfo in tf:
        print("tarfilename", tarinfo.name, "is", tarinfo.size, "bytes in size and is", end="")
        if tarinfo.isreg():
            print(" a regular file.")
        elif tarinfo.isdir():
            print(" a directory.")
        else:
            print(" something else.")
        if os.path.splitext(tarinfo_member.name)[1] == ".json":
            print("json file name:",os.path.splitext(tarinfo_member.name)[0])
            json_file = tf.extractfile(tarinfo_member)
            # capturing json file to read its contents and further processing.
            content = json_file.read()
            json_file_data = json.loads(content)
            print("Status Code",json_file_data[0]['status_code'])
            print("Response Body",json_file_data[0]['response'])
            # Had to decode content again as it was double encoded.
            print("Errors:",json.loads(json_file_data[0]['response'])['errors'])

【讨论】：