【问题标题】:Error while initializing Ray on an EC2 master node在 EC2 主节点上初始化 Ray 时出错
【发布时间】:2019-04-30 01:16:54
【问题描述】:

我正在使用 Ray 在 AWS EC2 上的 Ubuntu 14.04 集群上运行并行循环。以下 Python 3 脚本在我的本地机器上运行良好,只有 4 个工作人员(导入和本地初始化省略):-

ray.init()           #initialize Ray

@ray.remote
def test_loop(n):
    c=tests[n,0]                            
    tout=100                
    rc=-1   

    with tmp.TemporaryDirectory() as path: #Create a temporary directory        
        for files in filelist:        #then copy in all of the 
            sh.copy(filelist,path)    #files
        txtfile=path+'/inputf.txt'    #create the external
        fileId=open(txtfile,'w')      #data input text file,
        s='Number = '+str(c)+"\n"     #write test number,           
        fileId.write(s)
        fileId.close()                #close external parameter file,
        os.chdir(path)                #and change working directory

        try:                                    #Try running simulation:
            rc=sp.call('./simulation.run',timeout=tout,stdout=sp.DEVNULL,\
        stderr=sp.DEVNULL,shell=True)           #(must use .call for timeout)
            outdat=sio.loadmat('outputf.dat')   #get the output data struct
            rt_Data=outdat.get('rt_Data')       #extract simulation output
            err=float(rt_Data[-1])              #use final value of error
        except:                                 #If system fails to execute,
            err=deferr                          #use failure default 
        #end try

        if (err<=0) or (err>deferr) or (rc!=0): 
            err=deferr                          #Catch other types of failure
    return err 

if __name__=='__main__':
    result=ray.get([test_loop.remote(n) for n in range(0,ntest)])
    print(result)

这里的不寻常之处在于,simulation.run 在运行时必须从外部文本文件中读取不同的测试编号。循环的所有迭代的文件名相同,但测试编号不同。

我使用 Ray 启动了一个 EC2 集群,可用的 CPU 数量等于 n(我相信 Ray 不会默认使用多线程)。然后我不得不使用 rsync 将文件列表(包括 Python 脚本)从本地机器复制到主节点,因为我无法从配置中执行此操作(请参阅最近的问题:“Ray 未在 EC2 上启动工作人员” )。然后 ssh 进入该节点,并运行脚本。结果是文件查找错误:-

~$ python3 test_small.py
2019-04-29 23:39:27,065 WARNING worker.py:1337 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
2019-04-29 23:39:27,065 INFO node.py:469 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-29_23-39-27_3897/logs.
2019-04-29 23:39:27,172 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:42930 to respond...
2019-04-29 23:39:27,281 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:47779 to respond...
2019-04-29 23:39:27,282 INFO services.py:804 -- Starting Redis shard with 0.21 GB max memory.
2019-04-29 23:39:27,296 INFO node.py:483 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-29_23-39-27_3897/logs.
2019-04-29 23:39:27,296 INFO services.py:1427 -- Starting the Plasma object store with 0.31 GB memory using /dev/shm.
(pid=3917) sh: 0: getcwd() failed: No such file or directory
    2019-04-29 23:39:44,960 ERROR worker.py:1672 -- Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 909, in _process_task
self._store_outputs_in_object_store(return_object_ids, outputs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 820, in _store_outputs_in_object_store
self.put_object(object_ids[i], outputs[i])
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 375, in put_object
self.store_and_register(object_id, value)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 309, in store_and_register
self.task_driver_id))
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 238, in get_serialization_context
_initialize_serialization(driver_id)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 1148, in _initialize_serialization
serialization_context = pyarrow.default_serialization_context()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/serialization.py", line 326, in default_serialization_context
register_default_serialization_handlers(context)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/serialization.py", line 321, in register_default_serialization_handlers
_register_custom_pandas_handlers(serialization_context)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/pyarrow_files/pyarrow/serialization.py", line 129, in _register_custom_pandas_handlers
import pandas as pd
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/__init__.py", line 42, in <module>
from pandas.core.api import *
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/api.py", line 10, in <module>
from pandas.core.groupby import Grouper
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/groupby.py", line 49, in <module>
from pandas.core.frame import DataFrame
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 74, in <module>
from pandas.core.series import Series
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 3042, in <module>
import pandas.plotting._core as _gfx  # noqa
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/plotting/__init__.py", line 8, in <module>
from pandas.plotting import _converter
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/plotting/_converter.py", line 7, in <module>
import matplotlib.units as units
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 1060, in <module>
rcParams = rc_params()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 892, in rc_params
fname = matplotlib_fname()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 736, in matplotlib_fname
for fname in gen_candidates():
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/matplotlib/__init__.py", line 725, in gen_candidates
yield os.path.join(six.moves.getcwd(), 'matplotlibrc')
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

然后问题似乎对所有其他工人重复并最终放弃:-

AttributeError: module 'pandas' has no attribute 'core'

  This error is unexpected and should not have happened. Somehow a worker
  crashed in an unanticipated way causing the main_loop to throw an exception,
  which is being caught in "python/ray/workers/default_worker.py".

2019-04-29 23:44:08,489 ERROR worker.py:1672 -- A worker died or was killed while executing task 000000002d95245f833cdbf259672412d8455d89.
Traceback (most recent call last):
  File "test_small.py", line 82, in <module>
result=ray.get([test_loop.remote(n) for n in range(0,ntest)])
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2184, in get
raise value
ray.exceptions.RayWorkerError: The worker died unexpectedly while executing this task.

我怀疑我没有正确初始化 Ray。我尝试使用 ray.init(redis_address="172.31.50.149:6379") - 这是集群形成时给出的 redis 地址,但错误或多或少相同。我还尝试在 master 上启动 Ray(以防它需要启动):-

~$ ray start --redis-address 172.31.50.149:6379 #Start Ray
2019-04-29 23:46:20,774 INFO services.py:407 -- Waiting for redis server at 172.31.50.149:6379 to respond...
2019-04-29 23:48:29,076 INFO services.py:412 -- Failed to connect to the redis server, retrying.

....等等

【问题讨论】:

  • 这里看起来可能有几个错误。关于pandas的错误,你可以试试pip install -U pandas或者pip uninstall pandas(你用的是哪个版本的pandas)?您可以对matplotlib 进行相同的尝试。关于sh: 0: getcwd() failed: No such file or directory 错误,这可能是真正的问题,您是否从已删除的目录运行此脚本?实际上,您本地机器上的目录结构与其他机器上的目录结构是否不同?
  • 我正在导入 numpy、scipy.io、subprocess、tempfile、shutil、os、boto3 和 ray。除了我在集群配置中明确安装的后两者之外,我假设其余的都在 Anaconda3 安装中。我不知道使用 pandas 或 matplotlib,但也许 ray 期望它在那里。我将尝试在配置中将它们安装在 master 上。
  • 脚本在master的根目录下运行,而在本地机器上它是虚拟环境的主目录。但是,该脚本在启动 test_loop 时会为每个工作人员设置一个临时子目录。本地机器和集群的主要区别在于后者的worker大多在另一个节点上,所以我希望集群架构允许他们访问master上的工作目录。如果这不可能,我需要一种将数据文件分发到每个工作人员的默认工作目录的方法。

标签: python-3.x amazon-ec2 ray


【解决方案1】:

在master节点上安装pandas和matplotlib好像解决了这个问题。 Ray 现在初始化成功。

【讨论】:

  • 我认为原因可能是错误的redis地址。当一个使用 ssh 连接到主节点时,一个被带到 redis ip,所以这似乎是用于 ray.init 的正确地址。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-07-21
  • 2020-06-26
  • 2020-12-24
  • 2021-09-11
相关资源
最近更新 更多