如何在不替换 BASH 中先前替换的项目的情况下快速查找和替换列表中的许多项目？答案

【问题标题】：How to quickly find and replace many items on a list without replacing previously replaced items in BASH?如何在不替换 BASH 中先前替换的项目的情况下快速查找和替换列表中的许多项目？
【发布时间】：2011-12-22 13:42:44
【问题描述】：

我想对某些文本执行许多查找和替换操作。我有一个 UTF-8 CSV 文件，其中包含要查找的内容（在第一列中）和替换它的内容（在第二列中），从最长到最短排列。

例如：

orange,fruit2
carrot,vegetable1
apple,fruit3
pear,fruit4
ink,item1
table,item2

原始文件：

"I like to eat apples and carrots"

生成的输出文件：

"I like to eat fruit3s and vegetable1s."

但是，我想确保如果文本的一部分已经被替换，它不会与已经被替换的文本混淆。换句话说，我不希望它看起来像这样（它与蔬菜 1 中的“表”匹配）：

"I like to eat fruit3s and vegeitem21s."

目前，我使用这种方法很慢，因为我必须进行两次查找和替换：

(1) 将CSV转换为三个文件，例如：

a.csv     b.csv   c.csv
orange    0001    fruit2
carrot    0002    vegetable1
apple     0003    fruit3
pear      0004    fruit4
ink       0005    item1
table     0006    item 2

(2) 然后，将file.txt中a.csv中的所有项目替换为b.csv中的匹配列，并在单词周围使用ZZZ以确保以后匹配数字没有错误：

a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
    for i in `sed -n "$a"p ./b.csv`; do
        for j in `sed -n "$a"p ./a.csv`; do
            sed -i "s/$i/ZZZ$j\ZZZ/g" ./file.txt
            echo "Instances of '"$i"' replaced with '"ZZZ$j\ZZZ"' ("$a"/"$b")."
            a=`expr $a + 1`
            done
    done
done

(3) 然后再次运行相同的脚本，但将ZZZ0001ZZZ 替换为来自c.csv 的fruit2。

运行第一个替换大约需要 2 小时，但由于我必须运行此代码两次以避免编辑已替换的项目，因此需要两倍的时间。有没有更有效的方法来运行查找和替换而不对已替换的文本执行替换？

【问题讨论】：

您希望用什么语言或技术来做这件事？
在 Linux 中。我没有想到任何特定的语言，但我需要确保它可以支持 UTF-8。
每个文件有多少行？
要编辑的文件和列表各有 100,000 行。

标签： perl bash optimization replace sed

【解决方案1】：

这是一个 perl 解决方案，它在“一个阶段”进行替换。

#!/usr/bin/perl
use strict;
my %map = (
       orange => "fruit2",
       carrot => "vegetable1",
       apple  => "fruit3",
       pear   => "fruit4",
       ink    => "item1",
       table  => "item2",
);
my $repl_rx = '(' . join("|", map { quotemeta } keys %map) . ')';
my $str = "I like to eat apples and carrots";
$str =~ s{$repl_rx}{$map{$1}}g;
print $str, "\n";

【讨论】：

【解决方案2】：

Tcl 有一个命令可以做到这一点：string map

tclsh <<'END'
set map {
    "orange" "fruit2"
    "carrot" "vegetable1"
    "apple" "fruit3"
    "pear" "fruit4"
    "ink" "item1"
    "table" "item2"
}
set str "I like to eat apples and carrots"
puts [string map $map $str]
END

I like to eat fruit3s and vegetable1s

这是如何在 bash 中实现它（关联数组需要 bash v4）

declare -A map=(
    [orange]=fruit2
    [carrot]=vegetable1
    [apple]=fruit3
    [pear]=fruit4
    [ink]=item1
    [table]=item2
)
str="I like to eat apples and carrots"
echo "$str"
i=0
while (( i < ${#str} )); do
    matched=false
    for key in "${!map[@]}"; do
        if [[ ${str:$i:${#key}} = $key ]]; then
            str=${str:0:$i}${map[$key]}${str:$((i+${#key}))}
            ((i+=${#map[$key]}))
            matched=true
            break
        fi
    done
    $matched || ((i++))
done
echo "$str"

I like to eat apples and carrots
I like to eat fruit3s and vegetable1s

这不会很快。

显然，如果您以不同的方式订购地图，您可能会得到不同的结果。事实上，我认为"${!map[@]}" 的顺序是未指定的，因此您可能需要明确指定键的顺序：

keys=(orange carrot apple pear ink table)
# ...
    for key in "${keys[@]}"; do

【讨论】：

【解决方案3】：

一种方法是进行两阶段替换：

阶段1： s/橙色/@@1##/ s/胡萝卜/@@2##/ ... 阶段2： s/@@1##/fruit2/ s/@@2##/蔬菜1/ ...

应该选择@@1## 标记，这样它们就不会出现在原始文本或当然的替换中。

这是 perl 中的概念验证实现：

#!/usr/bin/perl -w
#

my $repls = $ARGV[0];
die ("first parameter must be the replacement list file") unless defined ($repls);
my $tmpFmt = "@@@%d###";

open(my $replsFile, "<", $repls) || die("$!: $repls");
shift;

my @replsList;

my $i = 0;
while (<$replsFile>) {
    chomp;
    my ($from, $to) = /\"([^\"]*)\",\"([^\"]*)\"/;
    if (defined($from) && defined($to)) {
        push(@replsList, [$from, sprintf($tmpFmt, ++$i), $to]);
    }
}

while (<>) {
    foreach my $r (@replsList) {
        s/$r->[0]/$r->[1]/g;
    }
    foreach my $r (@replsList) {
        s/$r->[1]/$r->[2]/g;
    }
    print;
}

【讨论】：

【解决方案4】：

我猜你的大部分缓慢来自于创建了这么多 sed 命令，每个命令都需要单独处理整个文件。对您当前的流程进行一些小的调整，每一步每个文件运行 1 个 sed 会大大加快这一进程。

a=1
b=`wc -l < ./a.csv`
while [ $a -le $b ]
do
    cmd=""
    for i in `sed -n "$a"p ./a.csv`; do
        for j in `sed -n "$a"p ./b.csv`; do
            cmd="$cmd ; s/$i/ZZZ${j}ZZZ/g"
            echo "Instances of '"$i"' replaced with '"ZZZ${j}ZZZ"' ("$a"/"$b")."
            a=`expr $a + 1`
        done
    done

    sed -i "$cmd" ./file.txt
done

【讨论】：

【解决方案5】：

做两次可能不是你的问题。如果你用你的基本策略成功地做到了一次，那仍然需要你一个小时，对吧？您可能需要使用不同的技术或工具。如上所述，切换到 Perl 可能会使您的代码更快（试一试）

但继续沿着其他海报的路径前进，下一步可能是流水线。编写一个替换两列的小程序，然后同时运行该程序两次。第一次运行将 column1 中的字符串替换为 column2 中的字符串，下一次将 column2 中的字符串替换为 column3 中的字符串。

你的命令行应该是这样的

cat input_file.txt | perl replace.pl replace_file.txt 1 2 | perl replace.pl replace_file.txt 2 3 > completely_replaced.txt

而replace.pl 会是这样（类似于其他解决方案）

#!/usr/bin/perl -w

my $replace_file = $ARGV[0];
my $before_replace_colnum = $ARGV[1] - 1;
my $after_replace_colnum = $ARGV[2] - 1;

open(REPLACEFILE, $replace_file) || die("couldn't open $replace_file: $!");

my @replace_pairs;

# read in the list of things to replace
while(<REPLACEFILE>) {
    chomp();

    my @cols = split /\t/, $_;
    my $to_replace = $cols[$before_replace_colnum];
    my $replace_with = $cols[$after_replace_colnum];

    push @replace_pairs, [$to_replace, $replace_with];
}

# read input from stdin, do swapping
while(<STDIN>) {
    # loop over all replacement strings
    foreach my $replace_pair (@replace_pairs) {
        my($to_replace,$replace_with) = @{$replace_pair};
        $_ =~ s/${to_replace}/${replace_with}/g;
    }
    print STDOUT $_;
}

【讨论】：

cat 真的没用，只需要一个perl就够了。
两个 perls 启用流水线
您可以将替换实现为子例程，并在一个 perl 中调用它两次。你根本不需要管道。
那会慢得多，真的。您的方法将仅使用单个处理器/内核。
你可能是对的。我要进行一些性能测试...但是cat 仍然不需要。 ;)

【解决方案6】：

bash+sed 方法：

count=0
bigfrom=""
bigto=""

while IFS=, read from to; do
   read countmd5sum x < <(md5sum <<< $count)
   count=$(( $count + 1 ))
   bigfrom="$bigfrom;s/$from/$countmd5sum/g"
   bigto="$bigto;s/$countmd5sum/$to/g"
done < replace-list.csv

sed "${bigfrom:1}$bigto" input_file.txt

我选择了 md5sum，以获得一些独特的令牌。但是也可以使用其他一些机制来生成这样的令牌；比如阅读/dev/urandom或shuf -n1 -i 10000000-20000000

【讨论】：

【解决方案7】：

awk+sed 方法：

awk -F, '{a[NR-1]="s/####"NR"####/"$2"/";print "s/"$1"/####"NR"####/"}; END{for (i=0;i<NR;i++)print a[i];}' replace-list.csv > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt

cat+sed+sed 方法：

cat -n replace-list.csv | sed -rn 'H;g;s|(.*)\n *([0-9]+) *[^,]*,(.*)|\1\ns/####\2####/\3/|;x;s|.*\n *([0-9]+)[ \t]*([^,]+).*|s/\2/####\1####/|p;${g;s/^\n//;p}' > /tmp/sed_script.sed
sed -f /tmp/sed_script.sed input.txt

机制：

在这里，它首先生成 sed 脚本，使用 csv 作为输入文件。
然后使用另一个sed实例对input.txt进行操作

注意事项：

生成的中间文件 - sed_script.sed 可以再次重复使用，除非输入 csv 文件发生更改。
####<number>#### 被选为输入文件中不存在的某种模式。如果需要，请更改此模式。
cat -n | 不是 UUOC :)

【讨论】：

【解决方案8】：

这可能对你有用（GNU sed）：

sed -r 'h;s/./&\\n/g;H;x;s/([^,]*),.*,(.*)/s|\1|\2|g/;$s/$/;s|\\n||g/' csv_file | sed -rf - original_file

将csv 文件转换为sed 脚本。这里的技巧是将替换字符串替换为不会被重新替换的替换字符串。在这种情况下，替换字符串中的每个字符都被其自身和\n 替换。最后，一旦所有替换发生，\n 将被删除，留下完成的字符串。

【讨论】：

【解决方案9】：

这里已经有很多很酷的答案了。我发布这个是因为我采用了一种稍微不同的方法，对要替换的数据做出了一些大的假设（基于样本数据）：

要替换的单词不包含空格
根据最长且完全匹配的前缀替换单词
要替换的每个单词都在 csv 中精确表示

这一次，awk 只用很少的正则表达式回答。

它将“repl.csv”文件读入一个关联数组（参见 BEGIN{} ），然后当单词的长度受键长度限制时，尝试匹配每个单词的前缀，试图避免查找尽可能使用关联数组：

#!/bin/awk -f

BEGIN {
    while( getline repline < "repl.csv" ) {
        split( repline, replarr, "," )
        replassocarr[ replarr[1] ] = replarr[2]
            # set some bounds on the replace word sizes
        if( minKeyLen == 0 || length( replarr[1] ) < minKeyLen )
            minKeyLen = length( replarr[1] )
        if( maxKeyLen == 0 || length( replarr[1] ) > maxKeyLen )
            maxKeyLen = length( replarr[1] )
    }
    close( "repl.csv" )
}

{
    i = 1
    while( i <= NF ) { print_word( $i, i == NF ); i++ }
}

function print_word( w, end ) {
    wl = length( w )
    for( j = wl; j >= 0 && prefix_len_bound( wl, j ); j-- ) {
        key = substr( w, 1, j )
        wl = length( key )
        if( wl >= minKeyLen && key in replassocarr ) {
            printf( "%s%s%s", replassocarr[ key ],
                substr( w, j+1 ), !end ? " " : "\n" )
            return
        }
    }
    printf( "%s%s", w, !end ? " " : "\n" )
}

function prefix_len_bound( len, jlen ) {
    return len >= minKeyLen && (len <= maxKeyLen || jlen > maxKeylen)
}

基于如下输入：

I like to eat apples and carrots
orange you glad to see me
Some people eat pears while others drink ink

它产生如下输出：

I like to eat fruit3s and vegetable1s
fruit2 you glad to see me
Some people eat fruit4s while others drink item1

当然，当要替换的单词长度为 1 或平均单词长度远大于要替换的单词时，不查看 replassocarr 的任何“节省”都会消失。

【讨论】：

我注意到，但没有编辑 print_word() 循环应该真正重构的示例，以便仅查看由最大和最小键镜头绑定的 substr() .现在，长词的结尾浪费了一些时间。