【发布时间】:2019-01-25 11:06:27
【问题描述】:
我正在尝试使用 Python 的 multiprocessing 将一个函数并行应用于 5 个交叉验证集,并对不同的参数值重复该操作,如下所示:
import pandas as pd
import numpy as np
import multiprocessing as mp
from sklearn.model_selection import StratifiedKFold
#simulated datasets
X = pd.DataFrame(np.random.randint(2, size=(3348,868), dtype='int8'))
y = pd.Series(np.random.randint(2, size=3348, dtype='int64'))
#dummy function to apply
def _work(args):
del(args)
for C in np.arange(0.0,2.0e-3,1.0e-6):
splitter = StratifiedKFold(n_splits=5)
with mp.Pool(processes=5) as pool:
pool_results = \
pool.map(
func=_work,
iterable=((C,X.iloc[train_index],X.iloc[test_index]) for train_index, test_index in splitter.split(X, y))
)
但是在执行的中途我收到以下错误:
Traceback (most recent call last):
File "mre.py", line 19, in <module>
with mp.Pool(processes=5) as pool:
File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool
context=self.get_context())
File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in __init__
self._repopulate_pool()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
w.start()
File "/usr/lib/python3.5/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/usr/lib/python3.5/multiprocessing/popen_fork.py", line 67, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
我在具有 32Gb 内存的 Ubuntu 16.04 上运行它,并在执行期间检查 htop 它从未超过 18.5Gb,所以我不认为我的内存不足。
这肯定是由于我的数据帧与来自splitter.split(X,y) 的索引的拆分,因为当我直接将我的数据帧传递给Pool 对象时,不会引发错误。
我看到this answer 说这可能是由于创建了太多文件依赖项,但我不知道如何解决这个问题,上下文管理器不应该帮助避免此类问题?
【问题讨论】:
标签: python scikit-learn multiprocessing