【问题标题】:Get word between quotes获取引号之间的单词
【发布时间】:2012-12-21 13:14:20
【问题描述】:

我有 x 行是这样的:

Unable to find latest released revision of 'CONTRIB_046578'.   

我需要在这个例子中提取revision of ''之间的单词CONTRIB_046578,如果可能的话,使用grepsed或任何其他命令计算该单词的出现次数?

【问题讨论】:

  • 你有没有付出任何努力?
  • 这个词有重复吗?中间是否还有其他需要丢弃的线?
  • 而不是在 ' ' 之间找到单词,我怎样才能在“revision of '”和“'”之间找到单词?

标签: linux unix sed awk grep


【解决方案1】:

最干净的解决方案是grep -Po "(?<=')[^']+(?=')"

$ cat file
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'foo'
Unable to find latest released revision of 'bar'
Unable to find latest released revision of 'CONTRIB_046578'

# Print occurences 
$ grep -Po "(?<=')[^']+(?=')" file
CONTRIB_046578
foo
bar
CONTRIB_046578

# Count occurences
$ grep -Pc "(?<=')[^']+(?=')" file
4

# Count unique occurrences 
$ grep -Po "(?<=')[^']+(?=')" file | sort | uniq -c 
2 CONTRIB_046578
1 bar
1 foo

【讨论】:

    【解决方案2】:

    您只需要一个非常简单的 awk 脚本来计算引号之间的出现次数:

    awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file
    

    使用@anubhava 的测试输入文件:

    $ cat file
    Unable to find latest released revision of 'CONTRIB_046572'
    Unable to find latest released revision of 'CONTRIB_046578'
    Unable to find latest released revision of 'CONTRIB_046579'
    Unable to find latest released revision of 'CONTRIB_046570'
    Unable to find latest released revision of 'CONTRIB_046579'
    Unable to find latest released revision of 'CONTRIB_046572'
    Unable to find latest released revision of 'CONTRIB_046579'
    $
    $ awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file
    CONTRIB_046578 1
    CONTRIB_046579 3
    CONTRIB_046570 1
    CONTRIB_046572 2
    

    【讨论】:

    • 而不是在 ' ' 之间找到单词,我怎样才能找到介于:“revision of '”和“'”之间的单词?
    • 有很多选项取决于您输入的内容,您试图避免错误匹配。一种方法是awk -F "(^.*revision of '|'[^']*$)" '{c[$2]++} END{for (w in c) print w,c[w]}' file。如果这对您不起作用,请告诉我们原因并提供更具代表性的输入文件。
    【解决方案3】:

    这是一个 awk 脚本,您可以使用它来提取和计算单引号中每个单词的出现频率:

    awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} 
          END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile
    

    测试

    cat infile
    Unable to find latest released revision of 'CONTRIB_046572'
    Unable to find latest released revision of 'CONTRIB_046578'
    Unable to find latest released revision of 'CONTRIB_046579'
    Unable to find latest released revision of 'CONTRIB_046570'
    Unable to find latest released revision of 'CONTRIB_046579'
    Unable to find latest released revision of 'CONTRIB_046572'
    Unable to find latest released revision of 'CONTRIB_046579'
    

    输出:

     awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} 
          END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile
    
    CONTRIB_046579 3
    CONTRIB_046578 1
    CONTRIB_046570 1
    CONTRIB_046572 2
    

    【讨论】:

      【解决方案4】:
      sed 's/.*\'(.*?)\'.*/$1/' myfile.txt
      

      【讨论】:

        【解决方案5】:

        假设:

        • 每个单词可以出现多次,OP 想要统计每个单词出现的次数。
        • 文件中没有其他行

        输入文件:

        $ cat test.txt 
        Unable to find latest released revision of 'CONTRIB_046578'.
        Unable to find latest released revision of 'CONTRIB_046572'.
        Unable to find latest released revision of 'CONTRIB_046579'.
        Unable to find latest released revision of 'CONTRIB_046570'.
        Unable to find latest released revision of 'CONTRIB_046572'.
        Unable to find latest released revision of 'CONTRIB_046578'.
        

        用于过滤和统计单词的Shell脚本:

        $ sed "s/.*'\(.*\)'.*/\1/" test.txt | sort | uniq -c
          1 CONTRIB_046570
          2 CONTRIB_046572
          2 CONTRIB_046578
          1 CONTRIB_046579
        

        【讨论】:

        • 分段错误错误。 te 文件中还有其他行,顺便说一下,搜索应该在“revision of '”和“'”之间。
        【解决方案6】:

        如果下面的测试文件代表实际问题中的文件,那么下面的文件可能会有用。

        基于测试文件中的每一行都是同质的 - 也就是说,格式正确并包含 8 列(或字段) - 使用 cut 命令的便捷解决方案如下如下:

        文件:

        Unable to find latest released revision of 'CONTRIB_046572'
        Unable to find latest released revision of 'CONTRIB_046578'
        Unable to find latest released revision of 'CONTRIB_046579'
        Unable to find latest released revision of 'CONTRIB_046570'
        Unable to find latest released revision of 'CONTRIB_046579'
        Unable to find latest released revision of 'CONTRIB_046572'
        Unable to find latest released revision of 'CONTRIB_046579'
        

        代码:

        cut -d ' ' -f 8 file | tr -d "'" | sort | uniq -c
        

        输出:

        1 CONTRIB_046570
        2 CONTRIB_046572
        1 CONTRIB_046578
        3 CONTRIB_046579
        

        代码注意:cut用来分隔每个字段的默认分隔符是tab,但是由于我们要求分隔符是单个空格来分隔每个字段,所以我们指定选项-d ' '。其余代码与其他答案类似,这里不再重复。

        一般说明:如果文件格式不正确,如我上面已经提到的,此代码可能无法达到所需的输出。

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2020-10-20
          • 1970-01-01
          • 1970-01-01
          • 2020-07-11
          • 2012-10-21
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多