如何提高从 s3 解析 json 的 Dask 性能答案

【问题标题】：How to improve Dask performance for parsing json from s3如何提高从 s3 解析 json 的 Dask 性能
【发布时间】：2017-10-11 18:08:04
【问题描述】：

我正在比较一个从 s3 加载数据并解析 json 内容的普通脚本。我想也许 Dask 在这类任务上会更快。然而，我使用的 Dask 脚本似乎比 ruby 脚本基准要慢得多。

这是脚本：

import time
import dask.bag as db
from dask.distributed import Client
import ujson
from s3fs import S3FileSystem

fs = S3FileSystem(anon=False)
client = Client()

target_id=2
target_path = "s3://bucket/log/2014/07/%d/"
target_path_dirs = [fs.ls(target_path % x) for x in range(10,21)]
target_paths = ['s3://'+x+"/*.json" for x in sum(target_path_dirs,[])]

t0 = time.time()
records = db.read_text(target_paths).map(ujson.loads)
filtered_records = records.filter(
       lambda x: x['id'] == target_id)

r_c = filtered_records.compute()
t1 = time.time()
total = t1 - t0
print(total)

ruby 脚本使用aws s3 cp --recursive 下载文件，然后继续解析 json 文件。只需 3 分钟，最终文件为 1.5 Mb。可能是什么问题？

我在一台机器上运行此脚本，但它有 8 个内核和 32GiB 的 RAM，并且在运行 Dask 脚本时所有内核似乎都在工作。

【问题讨论】：

标签： amazon-s3 dask

【解决方案1】：

似乎网络是瓶颈。我先下载数据，8秒解析，比脚本2分钟解析要快。

【讨论】：