【发布时间】:2021-01-03 16:12:14
【问题描述】:
我正在尝试使用 len(dataframe[column]) 查找 dask 数据帧的长度,但每次我尝试执行此操作时都会出错:
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\queues.py", line 238, in _feed
send_bytes(obj)
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\connection.py", line 280, in _send_bytes
ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
BrokenPipeError: [WinError 232] The pipe is being closed
distributed.nanny - ERROR - Nanny failed to start process
Traceback (most recent call last):
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\nanny.py", line 575, in start
await self.process.start()
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\process.py", line 34, in _call_and_set_future
res = func(*args, **kwargs)
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\site-packages\distributed\process.py", line 202, in _start
process.start()
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\connection.py", line 948, in reduce_pipe_connection
dh = reduction.DupHandle(conn.fileno(), access)
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\connection.py", line 170, in fileno
self._check_closed()
File "C:\Users\thakneh\AppData\Local\Continuum\anaconda3\lib\multiprocessing\connection.py", line 136, in _check_closed
raise OSError("handle is closed")
OSError: handle is closed
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
我的 dask 数据框有 1000 万行。有什么办法可以解决这个错误。
【问题讨论】:
-
请提供一个最小的、可重现的例子。考虑到集群的内存限制,您可能需要为数据帧使用更多分区。
标签: python dask dask-dataframe