【问题标题】:remove duplicateds words for multiple strings in bash删除bash中多个字符串的重复单词
【发布时间】:2021-12-30 21:09:55
【问题描述】:

我想知道如何使用 sed、awk 等从 bash 中的每一行中删除重复的单词...

我有这个包含 2000 行的文件,我想知道如何在每行中保留一个唯一的单词:

OG0000005 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373 K00373  K00373
OG0000006 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374 K00374  K00374
OG0000007 K03089  K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089 K03089
OG0000008 K15554  K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599 K15599
OG0000009 K15555  K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555 K15555
OG0000010 K00817  K09758 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817 K00817
OG0000011 K07267  K07267  K07267  K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267 K07267
OG0000012 K22397  K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714 K01714
OG0000013 K00370  K07812 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370 K00370

所以,输出应该是这样的:

OG0000005 K00373
OG0000006 K00374
OG0000007 K03089  
OG0000008 K15554  K15599 
OG0000009 K15555 
OG0000010 K00817  K09758

我试过了:

sort file | uniq

wile read line
do
sort && uniq
done < file

【问题讨论】:

    标签: bash awk sed


    【解决方案1】:

    另一个解决方案没有sedawk,如果你不关心单词的原始顺序,可能是:

    cat file | xargs -I _ sh -c "echo _ | tr ' ' '\n' | sort | uniq | tr '\n' ' '; echo"
    

    哪个输出:

    K00373 OG0000005 
    K00374 OG0000006 
    K03089 OG0000007 
    K15554 K15599 OG0000008 
    K15555 OG0000009 
    K00817 K09758 OG0000010 
    K07267 OG0000011 
    K01714 K22397 OG0000012 
    K00370 K07812 OG0000013 
    

    否则,如果第一个单词有特殊含义并且您希望将其保留在其位置,则解决方案可能是同时使用 cutpaste 以这种方式:

    cat file | cut -d' ' -f1 --complement | xargs -I _ sh -c "echo _ | tr ' ' '\n' | sort | uniq | tr '\n' ' '; echo" | paste -d' ' <(cut -d' ' -f1 file) -
    

    哪个输出:

    OG0000005 K00373 
    OG0000006 K00374 
    OG0000007 K03089 
    OG0000008 K15554 K15599 
    OG0000009 K15555 
    OG0000010 K00817 K09758 
    OG0000011 K07267 
    OG0000012 K01714 K22397 
    OG0000013 K00370 K07812 
    

    【讨论】:

      【解决方案2】:

      这可能对你有用(GNU sed):

      sed -E ':a;s/(( +\S+)\>.*)\2\>/\1/;ta' file
      

      替换以单词开头的字符串,该单词后来被原始字符串减去重复的单词重复。

      重复直到失败。

      【讨论】:

        【解决方案3】:

        纯 Bash 解决方案可能是:

        while read -r line; do 
            read -r -a a <<< "${line}"
            declare -A b
            for i in "${a[@]:1}"; do b["$i"]=1; done
            printf '%s %s\n' "${a[0]}" "${!b[*]}"
            unset b
        done <file
        

        要使 Bash + sort + uniq 方法起作用,您可以这样做:

        while read -r line; do 
            read -r -a a <<< "${line}"
            re=$(tr ' ' '\n' <<< "${a[@]:1}" | sort | uniq | tr '\n' ' ' | xargs)
            printf "%s %s\n" "${a[0]}" "${re}"
        done <file  
        # if supported by your sort, you can also do 
        # re=$(tr ' ' '\n' <<< "${a[@]:1}" | sort -u | tr '\n' ' ' | xargs)
        

        任一打印:

        OG0000005 K00373 
        OG0000006 K00374 
        OG0000007 K03089 
        OG0000008 K15554 K15599 
        OG0000009 K15555 
        OG0000010 K00817 K09758 
        OG0000011 K07267 
        OG0000012 K22397 K01714 
        OG0000013 K00370 K07812 
        

        【讨论】:

          【解决方案4】:

          您可以使用这个awk 解决方案:

          awk '
          {
             delete seen
             printf "%s", $1
             for (i=2; i<=NF; ++i)
                if (!seen[$i]++)
                   printf "%s", OFS $i
             print ""
          }' file
          
          OG0000005 K00373
          OG0000006 K00374
          OG0000007 K03089
          OG0000008 K15554 K15599
          OG0000009 K15555
          OG0000010 K00817 K09758
          OG0000011 K07267
          OG0000012 K22397 K01714
          OG0000013 K00370 K07812
          

          【讨论】:

            猜你喜欢
            • 2012-03-14
            • 1970-01-01
            • 2013-12-15
            • 2013-08-11
            • 1970-01-01
            • 2013-05-26
            • 1970-01-01
            相关资源
            最近更新 更多