并行运行 bash 命令，跟踪结果和计数答案

【问题标题】：Run bash commands in parallel, track results and count并行运行 bash 命令，跟踪结果和计数
【发布时间】：2011-06-17 09:45:44
【问题描述】：

我想知道如果可能的话，我如何在 BASH 中创建一个简单的作业管理来并行处理多个命令。也就是说，我有一大串要运行的命令，我希望在任何给定时间运行其中的两个。

我对 bash 了解不少，所以下面是一些使它变得棘手的要求：

这些命令的运行时间是可变的，所以我不能只生成 2 个，等待，然后继续接下来的两个。一旦一个命令完成，就必须运行下一个命令。
控制进程需要知道每条命令的退出码，这样才能保存失败的总数

我正在考虑以某种方式使用trap，但我没有看到一种简单的方法来获取处理程序中子项的退出值。

那么，有什么想法可以做到这一点吗？

好吧，这里有一些应该可以工作的概念验证代码，但它会破坏 bash：生成无效的命令行、挂起，有时还会出现核心转储。

# need monitor mode for trap CHLD to work
set -m
# store the PIDs of the children being watched
declare -a child_pids

function child_done
{
    echo "Child $1 result = $2"
}

function check_pid
{
    # check if running
    kill -s 0 $1
    if [ $? == 0 ]; then
        child_pids=("${child_pids[@]}" "$1")
    else
        wait $1
        ret=$?
        child_done $1 $ret
    fi
}

# check by copying pids, clearing list and then checking each, check_pid
# will add back to the list if it is still running
function check_done
{
    to_check=("${child_pids[@]}")
    child_pids=()

    for ((i=0;$i<${#to_check};i++)); do
        check_pid ${to_check[$i]}
    done
}

function run_command
{
    "$@" &
    pid=$!
    # check this pid now (this will add to the child_pids list if still running)
    check_pid $pid
}

# run check on all pids anytime some child exits
trap 'check_done' CHLD

# test
for ((tl=0;tl<10;tl++)); do
    run_command bash -c "echo FAIL; sleep 1; exit 1;"
    run_command bash -c "echo OKAY;"
done

# wait for all children to be done
wait

请注意，这不是我最终想要的，但会成为获得我想要的基础。

跟进：我已经用 Python 实现了一个系统来执行此操作。因此，任何使用 Python 编写脚本的人都可以拥有上述功能。参考shelljob

【问题讨论】：

您可以使用shell内置的'wait'命令来获取每个子进程并获取其退出状态，但您需要等待特定的pid，否则直到所有子进程都退出后才会返回。不过，您不想在信号处理程序中等待。这在 bash 中很棘手，老实说在 C 中更容易做到。
好吧，如果我能在信号处理程序中获得 PID，我想我会没事的，但无论如何我都看不到 PID。我知道它可以用其他语言轻松完成，但我正在尝试对 bash 脚本进行扩展。

标签： bash

【解决方案1】：

GNU Parallel 太棒了：

$ parallel -j2 < commands.txt
$ echo $?

它将退出状态设置为失败的命令数。如果您有超过 253 个命令，请查看 --joblog。如果您不知道前面的所有命令，请查看--bg。

【讨论】：

非常感谢您的参考。这个命令看起来很棒。我会看看我是否可以调整我的剧本。
FWIW，xargs -P2 -n1 -d '\n' sh -c < commands.txt 之类的东西可以用作穷人的 parallel 替代品

【解决方案2】：

我可以说服你使用 make 吗？这样做的好处是你可以告诉它并行运行多少个命令（修改 -j 号）

echo -e ".PHONY: c1 c2 c3 c4\nall: c1 c2 c3 c4\nc1:\n\tsleep 2; echo c1\nc2:\n\tsleep 2; echo c2\nc3:\n\tsleep 2; echo c3\nc4:\n\tsleep 2; echo c4" | make -f - -j2

将其粘贴到 Makefile 中，它的可读性会更高

.PHONY: c1 c2 c3 c4
all: c1 c2 c3 c4
c1:
        sleep 2; echo c1
c2:
        sleep 2; echo c2
c3:
        sleep 2; echo c3
c4:
        sleep 2; echo c4

请注意，那些不是行首的空格，它们是一个 TAB，所以在这里剪切和粘贴不起作用。

如果您没有回显命令，请在每个命令前添加一个“@”。例如：

        @sleep 2; echo c1

这将在第一个失败的命令上停止。如果您需要计算失败次数，则需要以某种方式在 makefile 中对其进行设计。也许像

command || echo F >> failed

然后检查失败的长度。

【讨论】：

不，这不符合我的要求。生成了所有命令行，我需要保留失败和正常的总数。另外，如果其中一个孩子失败了，我不想停止跑步。
“command || echo F >> failed”将使它们在失败时继续。生成的命令是什么意思？这和这个有什么关系？
我想我可以从 bash 脚本生成 make 文件。我对输出没有太多控制权。另外，我仍然没有简单的方法来计算结果（总数和失败）。我并不是说它行不通，这不是一个简单的解决方案。

【解决方案3】：

您遇到的问题是您不能等待多个后台进程之一完成。如果您观察作业状态（使用作业），则会从作业列表中删除已完成的后台作业。您需要另一种机制来确定后台作业是否已完成。

以下示例使用启动到后台进程（睡眠）。然后它使用 ps 循环以查看它们是否仍在运行。如果不是，它会使用 wait 来收集退出代码并启动一个新的后台进程。

#!/bin/bash

sleep 3 &
pid1=$!
sleep 6 &
pid2=$!

while ( true ) do
    running1=`ps -p $pid1 --no-headers | wc -l`
    if [ $running1 == 0 ]
    then
        wait $pid1
        echo process 1 finished with exit code $?
        sleep 3 &
        pid1=$!
    else
        echo process 1 running
    fi

    running2=`ps -p $pid2 --no-headers | wc -l`
    if [ $running2 == 0 ]
    then
        wait $pid2
        echo process 2 finished with exit code $?
        sleep 6 &
        pid2=$!
    else
        echo process 2 running
    fi
    sleep 1
done

编辑：使用 SIGCHLD（无轮询）：

#!/bin/bash

set -bm
trap 'ChildFinished' SIGCHLD

function ChildFinished() {
    running1=`ps -p $pid1 --no-headers | wc -l`
    if [ $running1 == 0 ]
    then
        wait $pid1
        echo process 1 finished with exit code $?
        sleep 3 &
        pid1=$!
    else
        echo process 1 running
    fi

    running2=`ps -p $pid2 --no-headers | wc -l`
    if [ $running2 == 0 ]
    then
        wait $pid2
        echo process 2 finished with exit code $?
        sleep 6 &
        pid2=$!
    else
        echo process 2 running
    fi
    sleep 1
}

sleep 3 &
pid1=$!
sleep 6 &
pid2=$!

sleep 1000d

【讨论】：

可以在没有轮询的情况下以某种方式完成吗？如果我只用 BASH 消耗一个处理器，那么并行运行的部分价值就会有些损失。
这里的问题是 ChildFinished 在您设法设置 pid1 之前可能会被调用。显然不是sleep 3，但一些随机进程可能会快速退出（特别是如果它在启动时出错）
使用(sleep 1 && realcommand) &怎么样？在调用 ChildFinished 之前，这始终需要至少一秒钟。第二个命令完成仍然存在竞争，因此可能在开始下一个命令之前将 pid1 设置为 0（无效）并在 ChildFinished 中检查。
我不喜欢睡眠，但设置为 0 似乎还可以。我将在检查中跳过 0，并且每次启动进程时，在分配变量后再次进行检查（如果已经完成）。我将把它包装在几个数组中，看看我是否可以让它按我的意愿工作。
这也假定 bash 已正确清理，否则 ps 可能会返回 1 并带有僵尸进程。通常会这样，所以我可能会没事的。

【解决方案4】：

我认为以下示例回答了您的一些问题，我正在研究其余问题

(cat list1 list2 list3 | sort | uniq > list123) &
(cat list4 list5 list6 | sort | uniq > list456) &

来自：

Running parallel processes in subshells

【讨论】：

【解决方案5】：

还有另一个用于 debian 系统的软件包，名为 xjobs。

您可能想检查一下：

http://packages.debian.org/wheezy/xjobs

【讨论】：

【解决方案6】：

如果由于某种原因您无法安装 parallel，这将在普通 shell 或 bash 中工作

# String to detect failure in subprocess
FAIL_STR=failed_cmd

result=$(
    (false || echo ${FAIL_STR}1) &
    (true  || echo ${FAIL_STR}2) &
    (false || echo ${FAIL_STR}3)
)
wait

if [[ ${result} == *"$FAIL_STR"* ]]; then
    failure=`echo ${result} | grep -E -o "$FAIL_STR[^[:space:]]+"`
    echo The following commands failed:
    echo "${failure}"
    echo See above output of these commands for details.
    exit 1
fi

其中true 和false 是您的命令的占位符。你也可以回显 $?与FAIL_STR 一起获取命令状态。

【讨论】：

【解决方案7】：

又一个 bash 唯一的示例供您参考。当然，更喜欢使用 GNU 并行，它将提供更多开箱即用的功能。

此解决方案涉及创建用于收集作业状态的 tmp 文件输出。

我们使用/tmp/${$}_作为临时文件前缀$$是实际的父进程号，所有脚本执行都相同。

首先，批量启动并行作业的循环。批量大小使用max_parrallel_connection 设置。 try_connect_DB() 是同一文件中的慢速 bash 函数。这里我们收集 stdout + stderr 2>&1 进行故障诊断。

nb_project=$(echo "$projects" | wc -w)
i=0
parrallel_connection=0
max_parrallel_connection=10
for p in $projects
do
  i=$((i+1))
  parrallel_connection=$((parrallel_connection+1))
  try_connect_DB $p "$USERNAME" "$pass" > /tmp/${$}_${p}.out 2>&1 &

  if [[ $parrallel_connection -ge $max_parrallel_connection ]]
  then
    echo -n " ... ($i/$nb_project)"
    wait
    parrallel_connection=0
  fi
done
if [[ $nb_project -gt $max_parrallel_connection ]]
then
  # final new line
  echo
fi

# wait for all remaining jobs
wait

运行所有作业完成后查看所有结果：

SQL_connection_failed 是我们的错误约定，由try_connect_DB() 输出，您可以按照最适合您需要的方式过滤作业成功或失败。

在这里，我们决定只输出失败的结果，以减少大型作业的输出量。特别是如果他们中的大多数或全部成功通过。

# displaying result that failed
file_with_failure=$(grep -l SQL_connection_failed /tmp/${$}_*.out)
if [[ -n $file_with_failure ]]
then
  nb_failed=$(wc -l <<< "$file_with_failure")
  # we will collect DB name from our output file naming convention, for post treatment
  db_names=""
  echo "=========== failed connections : $nb_failed/$nb_project"
  for failure in $file_with_failure
  do
    echo "============ $failure"
    cat $failure
    db_names+=" $(basename $failure | sed -e 's/^[0-9]\+_\([^.]\+\)\.out/\1/')"
  done
  echo "$db_names"
  ret=1
else
  echo "all tests passed"
  ret=0
fi

# temporary files cleanup, could be kept is case of error, adapt to suit your needs.
rm /tmp/${$}_*.out
exit $ret

【讨论】：