性能：使用 AWK 的 While 循环答案

【问题标题】：Performance: While Loop with AWK性能：使用 AWK 的 While 循环
【发布时间】：2017-10-23 18:45:49
【问题描述】：

我有一个文件要导入到数据库表中，但我想在每一行中都有一个文件。在导入中，我需要为每一行指明偏移量（第一个字节）和长度（字节数）

我有以下文件：

*line_numbers.txt* -> Each row contains the number of 
                      the last row of a record in *plans.txt*.

*plans.txt* ->  All the information required for all the rows.

我有以下代码：

#Starting line number of the record
sLine=0

#Starting byte value of the record
offSet=0

while read line
do
    endByte=`awk -v fline=${sLine} -v lline=${line} \
                 '{if (NR > fline && NR < lline) \
                      sum += length($0); } \
                 END {print sum}' plans.txt`
    echo "\"plans.txt.${offSet}.${endByte}/\"" >> lobs.in
    sLine=$((line+1))
    offSet=$((endByte+offSet))
done < line_numbers.txt

此代码将在文件 lobs.in 中写入类似于：

"plans.txt.0.504/"
"plans.txt.505.480/"
"plans.txt.984.480/"
"plans.txt.1464.1159/"
"plans.txt.2623.515/"

这意味着，例如，第一条记录从字节 0 开始，并持续到接下来的 504 个字节。下一个从字节 505 开始并持续到接下来的 480 个字节。

我仍然需要运行更多测试，但它似乎正在工作。我的问题是我需要处理的卷非常慢。

您有任何性能提示吗？

我想办法在 awk 中插入循环，但我需要 2 个输入文件，而且我不知道如何在没有 while 的情况下处理它。

谢谢！

【问题讨论】：

请参阅why-is-using-a-shell-loop-to-process-text-considered-bad-practice，了解不使用 shell 循环操作文本的一些原因。只需使用 awk。 edit您的问题包括简洁、可测试的样本输入和预期输出，以便我们为您提供帮助。

标签： shell awk sh

【解决方案1】：

在awk 中完成这一切会快得多。

假设你有：

$ cat lines.txt
100
200
300
360
10000
50000

还有：

$ awk -v maxl=50000 'BEGIN{for (i=1;i<=maxl;i++) printf "Line %d\n", i}' >data.txt

（所以你在文件data.txt中有Line 1\nLine 2\n...Line maxl）

你会做这样的事情：

awk 'FNR==NR{lines[FNR]=$1; next}
            {data[FNR]=length($0); next}
     END{ sl=1
          for (i=1; i in lines; i++) {
               bc=0
               for (j=sl; j<=lines[i]; j++){
                   bc+=data[j]
               }
               printf "line %d to %d is %d bytes\n", sl, j-1, bc
               sl=lines[i]+1
          }    
}' lines.txt data.txt
line 1 to 100 is 1392 bytes
line 101 to 200 is 1500 bytes
line 201 to 300 is 1500 bytes
line 301 to 360 is 900 bytes
line 361 to 10000 is 153602 bytes
line 10001 to 50000 is 680000 bytes

【讨论】：

【解决方案2】：

简单的改进。永远不要用>> 重定向inside循环，可以用>> 在循环外重定向。更糟糕的是：

while read line
do
    # .... stuff omitted ... 
    echo "\"plans.txt.${offSet}.${endByte}/\"" >> lobs.in
    # ....
done < line_numbers.txt

注意循环中唯一输出任何内容的行是echo。更好：

while read line
do
    # .... stuff omitted ... 
    echo "\"plans.txt.${offSet}.${endByte}/\""
    # ....
done < line_numbers.txt >> lobs.in

【讨论】：

不要假设lobs.in在循环之前是空的；在循环外也使用>>。