在 BASH 中从文件中选择随机行花费的时间太长答案

【问题标题】：Choosing random lines from file takes too long in BASH在 BASH 中从文件中选择随机行花费的时间太长
【发布时间】：2015-01-05 23:34:21
【问题描述】：

所以我有这个语法的脚本：

./script number file

其中 number 是我想从文件 file 中获取的行数。这些行是随机选择的，然后打印两次。考虑到一个非常大的文件 ~ 1 000 000 行，这个算法运行太慢了。我不知道为什么，因为打印只包括访问数组。

#!/bin/bash

max=`wc -l $2 | cut -d " " -f1`

users=(`shuf -i 0-$max -n $1`)
pages=(`shuf -i 0-$max -n $1`)

readarray lines < $2

for (( i = 0; i < $1; i++ )); do
    echo L ${lines[${users[i]}]} ${lines[${pages[i]}]} 
done

for (( i = 0; i < $1; i++ )); do
    echo U ${lines[${users[i]}]} ${lines[${pages[i]}]} 
done

【问题讨论】：

数组在 Bash 中是出了名的低效，你应该能够使用 for 循环来执行此操作，然后使用 $RANDOM Bash 变量 modded 来获取范围内的行号，然后你可以构建一个字符串并像 sed 一样打印，sed -n '4p;500p;245p;6773334p;34322p'

标签： performance algorithm bash random

【解决方案1】：

以下内容应该可以相当快地完成您想要的操作，bash 数组很慢并且使用临时文件构建，因此使用它们的性能应该不会更好 - 如果它们由 Bash 维护人员正确实现，它们将是一个不错的功能，但它们是还没有：

File (make sure to name it the same, this is recursive):ranlines.bsh

#!/bin/bash
declare -i max=$(wc -l $2 | cut -d " " -f1)+1
declare STR=""
declare -i random_line=0
declare tmp_file="/tmp/_$$_$(date)"
declare -r usr_file="/tmp/_user_3434"
declare -r pgs_file="/tmp/_pgs_4343"

## create our tmp_file and tell it dont use 0 
echo "0" >> "$tmp_file" 

for (( i = 0; i < $1; i++ )); do
 while :; do 
   random_line=$(($RANDOM*30%$max));
   ## if you find an entry already in the tmp_file then continue 
   ## get a new number, loop until you find a new number
   (($(grep -c "$random_line" "$tmp_file"))) && continue;
   echo "$random_line" >> "$tmp_file" 
   break; 
 done 
 ## build the sed print string
 STR="$STR${random_line}p;"
done
rm "$tmp_file" 

if [[ $# -eq 2 ]]; then 
 #usr_file
 eval "sed -n '$STR' $2" > "$usr_file" 
 ## call us again, this time for the U 
 ranlines.bsh $1 $2 "U"
else 
 ## we know already we are processing the U because args is not 2 
 declare -i random_slct=$1+1
 eval "sed -n '$STR' $2" > "$pgs_file" 
 paste <(sed -n "${random_slct}q; a L" "$2") "$usr_file" "$pgs_file"
 paste <(sed -n "${random_slct}q; a U" "$2") "$pgs_file" "$usr_file"
 rm "$pgs_file" "$usr_file"
fi   
exit 0

【讨论】：

【解决方案2】：

只需使用shuf 选择行，这就是它的设计目的。例如（见注）：

readarray users < <(shuf -n $1 "$2")
readarray pages < <(shuf -n $1 "$2")
for (( i = 0; i < $1; i++ )); do
    echo L ${users[i]} ${pages[i]} 
done
for (( i = 0; i < $1; i++ )); do
    echo U ${users[i]} ${pages[i]} 
done

这仍然会很慢，因为shuf 需要读取整个文件才能找到行尾，并且您要调用它两次，但它可能比将整个文件读入内存更快一个 bash 数组，尤其是在您没有大量可用内存的情况下。（如果脚本的第二个参数不是常规文件，它也不起作用；如果是管道，则不能读取两次。）

您可以通过同时选择两组行然后将它们划分为users 和pages 来加快速度，但假设您关心这一点，您需要做一些工作来获得公正的分布。

注 1：

正如@gniourf_gniourf 在评论中所指出的，通过使用-t 选项到readarray，然后将参数引用到echo，您可以获得更准确的线条渲染。此外，mapfile 是 readarray 的首选名称：

mapfile -t users < <(shuf -n $1 "$2")
mapfile -t pages < <(shuf -n $1 "$2")
for (( i = 0; i < $1; i++ )); do
    echo L "${users[i]}" "${pages[i]}" 
done
for (( i = 0; i < $1; i++ )); do
    echo U "${users[i]}" "${pages[i]}"
done

注2：

如果$1 很大，最好不要使用数组。这是一种可能的解决方案：

lines="$(paste -d' ' <(shuf -n $1 "$2") <(shuf -n $1 "$"))"
sed 's/^/L /' <<<"$lines"
sed 's/^/U /' <<<"$lines"

【讨论】：

readarray users < <(shuf -n $1 lines)和readarray users <<< "$(shuf -n $1 lines)"之间有速度差异吗？
@EdouardThiel：如果$1 很大，第一个会更快，因为第二个需要一个中间步骤，即在内存中创建一个字符串，然后创建一个管道以从内存中读取字符串。（至少，我认为它会更快；我实际上并没有对其进行基准测试。）
@rici 您可能是对的，但值得一试。 Horkyze 你能检查一下你的数据并告诉你最快的方法是什么吗？
@gniourf_gniourf：好的，添加了建议。恕我直言，这与问题正交。
Tx，所以 <<< 在这种情况下获胜。

【解决方案3】：

也许你可以完全不用数组，只使用文件实用程序和临时文件：

# Put the shuf outputs in two separate files:

shuf -n "$1" "$2" > shuf_users
shuf -n "$1" "$2" > shuf_pages

# paste the two:
paste -d ' ' shuf_users shuf_pages | sed 's/^/L /'
paste -d ' ' shuf_pages shuf_users | sed 's/^/U /'

在@rici 的解决方案中，罪魁祸首也可能在输出行的两个循环中（例如for 循环非常慢）。

您应该使用mktemp 创建临时文件shuf_users 和shuf_pages。这个练习留给读者。

【讨论】：

@Horkyze 小心，我在第二个 paste 声明中交换了 shuf_users 和 shuf_pages。
@Horkyze：看看@rici 在他的注释2 中的解决方案，很不错！如果我没有误读您的问题:)，我想我会那样做。