【问题标题】:How can I extract all quotations in a text?如何提取文本中的所有引用?
【发布时间】:2010-09-25 12:06:41
【问题描述】:

我正在寻找一个SimpleGrepSedPerlOrPythonOneLiner,它可以输出文本中的所有引用。


示例 1:

echo “HAL,” noted Frank, “said that everything was going extremely well.” | SimpleGrepSedPerlOrPythonOneLiner

标准输出:

"HAL,"
"said that everything was going extremely well.”

示例 2:

cat MicrosoftWindowsXPEula.txt | SimpleGrepSedPerlOrPythonOneLiner

标准输出:

"EULA"
"Software"
"Workstation Computer"
"Device"
"DRM"

等等

(link to the corresponding text).

【问题讨论】:

    标签: perl sed grep quotations


    【解决方案1】:
    grep -o "\"[^\"]*\""
    

    此 greps 为 " + 除引号外的任何内容,任意次数 + "

    -o 使它只输出匹配的文本,而不是整行。

    【讨论】:

    • 在 Windows 上 '^' 必须转义。 cat eula.txt | grep -o "\"[^^\"]*\""
    【解决方案2】:

    如果您有嵌套引号,则任何正则表达式解决方案都不起作用,但对于您的示例,这很好用

    $ echo \"HAL,\" noted Frank, \"said that everything was going extremely well\"  
     | perl -n -e 'while (m/(".*?")/g) { print $1."\n"; }'
    "HAL,"
    "said that everything was going extremely well"
    
    $ cat eula.txt| perl -n -e 'while (m/(".*?")/g) { print $1."\n"; }'
    "EULA"
    "online"
    "Software"
    "Workstation Computer"
    "Device"
    "multiplexing"
    "DRM"
    "Secure Content"
    "DRM Software"
    "Secure Content Owners"
    "DRM Upgrades"
    "WMFSDK"
    "Not For Resale"
    "NFR,"
    "Academic Edition"
    "AE,"
    "Qualified Educational User."
    "Exclusion of Incidental, Consequential and Certain Other Damages"
    "Restricted Rights"
    "Exclusion des dommages accessoires, indirects et de certains autres dommages"
    "Consumer rights"
    

    【讨论】:

    • 在 Windows 上:cat eula.txt | perl -nE"say $1 while /(\"[^^\"]*\")/g"
    • 猫 eula.txt | perl -lne 'print for /(".*?")/g' Perl Golf FTW! ;)
    • 好吧,一些正则表达式引擎处理嵌套引号,所以一些正则表达式解决方案可以工作:)
    • @brian 是的,但我不想谈这个,因为我有点忙,还没有深入到那里去解释它。 :)
    【解决方案3】:

    我喜欢这个:

    perl -ne 'print "$_\n" foreach /"((?>[^"\\]|\\+[^"]|\\(?:\\\\)*")*)"/g;'
    

    它有点冗长,但它处理转义引号和回溯比最简单的实现要好得多。它的意思是:

    my $re = qr{
       "               # Begin it with literal quote
       ( 
         (?>           # prevent backtracking once the alternation has been
                       # satisfied. It either agrees or it does not. This expression
                       # only needs one direction, or we fail out of the branch
    
             [^"\\]    # a character that is not a dquote or a backslash
         |   \\+       # OR if a backslash, then any number of backslashes followed by 
             [^"]      # something that is not a quote
         |   \\        # OR again a backslash
             (?>\\\\)* # followed by any number of *pairs* of backslashes (as units)
             "         # and a quote
         )*            # any number of *set* qualifying phrases
      )                # all batched up together
      "                # Ended by a literal quote
    }x;
    

    如果你不需要那么大的力量——比如说它可能只是对话而不是结构化的引语,那么

    /"([^"]*)"/ 
    

    可能和其他任何东西一样有效。

    【讨论】:

      【解决方案4】:
      grep -o '"[^"]*"' file
      

      选项 '-o' 仅打印模式

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-03-18
        • 2022-10-14
        • 2013-10-16
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多