【问题标题】:Why mpirun freeze in loop为什么 mpirun 冻结在循环中
【发布时间】:2018-04-29 11:18:42
【问题描述】:

这是我的脚本和 python 代码。

$ 猫走

while true
do
echo "------->"
python3 -m mpi4py ./go.py
echo "<------"
done

此代码在循环中运行 python go.py。

$ cat go.py

import mpi4py.MPI as MPI

print( "######", MPI.Is_initialized())

comm = MPI.COMM_WORLD
comm_rank = comm.Get_rank()
comm_size = comm.Get_size()

# point to point communication
data_send = [comm_rank]*5
comm.send(data_send,dest=(comm_rank+1)%comm_size)
data_recv =comm.recv(source=(comm_rank-1)%comm_size)
print("my rank is %d, and Ireceived:" % comm_rank)
print( data_recv )

MPI.Finalize()

print( "######", MPI.Is_finalized())

这个 python 代码只是打印出来的。

我运行这个go脚本后,go.py执行并退出,当go.py再次执行时, 卡住了。

$ mpirun --mca orte_base_help_aggregate 0 -np 2 sh ./go

------->
------->
--------------------------------------------------------------------------
[[27909,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: myvm20

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[[27909,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: myvm20

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
###### True
###### True
my rank is 0, and Ireceived:
[1, 1, 1, 1, 1]
my rank is 1, and Ireceived:
[0, 0, 0, 0, 0]
###### True
###### True
<------
------->
<------
------->
--------------------------------------------------------------------------
[[27909,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: myvm20

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
[[27909,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: myvm20

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

永远冻结。

为什么卡住了,我该如何继续这个脚本?

顺便说一句: 我有两种工作 A/B 要运行,工作 A 坚持,工作 B 完成并退出。所以我不能按如下方式运行它:

while true
do
  echo "------->"
  mpirun -np 2 A : -np 2 B
  echo "<------"
done

还有其他方法吗?

【问题讨论】:

    标签: python linux mpi


    【解决方案1】:

    长话短说,你不能那样做。

    你应该这样做

    while true
    do
      echo "------->"
      mpirun --mca orte_base_help_aggregate 0 -np 2 python3 -m mpi4py ./go.py
      echo "<------"
    done
    

    【讨论】:

    • 我有两种工作 A/B 要运行,工作 A 坚持,工作 B 完成并退出。所以我不能按以下方式运行它:mpirun -np 2 A : -np 2 B,还有其他方法吗?
    • 如果我理解正确,你可以mpirun -np 2 A,其中A进入一个循环,MPI_Comm_spawn()在两个任务上工作B,等待B完成,MPI_Comm_disconnect()然后重新迭代。由于这是一个不同的问题,您应该创建一个新问题并为AB 编写一个最小示例,以阐明您所说的 "persist" 的含义
    猜你喜欢
    • 1970-01-01
    • 2021-05-26
    • 1970-01-01
    • 2016-01-24
    • 1970-01-01
    • 2017-02-26
    • 1970-01-01
    • 1970-01-01
    • 2014-09-04
    相关资源
    最近更新 更多