【问题标题】:Ruby regex extracting wordsRuby正则表达式提取单词
【发布时间】:2011-12-31 00:54:32
【问题描述】:

我目前正在努力想出一个正则表达式,它可以将字符串拆分为单词,其中单词被定义为由空格包围或用双引号括起来的字符序列。我正在使用String#scan

比如字符串:

'   hello "my name" is    "Tom"'

应该匹配单词:

hello
my name
is
Tom

我设法匹配用双引号括起来的单词:

/"([^\"]*)"/

但我不知道如何将空格字符包围起来以获得“你好”、“是”和“汤姆”,同时又不搞砸“我的名字”。

对此的任何帮助将不胜感激!

【问题讨论】:

    标签: ruby regex string word


    【解决方案1】:
    result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)
    

    会为你工作。它会打印

    => ["", "hello", "\"my name\"", "is", "\"Tom\""]
    

    忽略空字符串。

    说明

    "
    \\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
       +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
    (?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
       (?:           # Match the regular expression below
          [^\"]          # Match any character that is NOT a “\"”
             *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
          \"             # Match the character “\"” literally
          [^\"]          # Match any character that is NOT a “\"”
             *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
          \"             # Match the character “\"” literally
       )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
       [^\"]          # Match any character that is NOT a “\"”
          *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
       \$             # Assert position at the end of a line (at the end of the string or before a line break character)
    )
    "
    

    您可以像这样使用reject 来避免空字符串

    result = '   hello "my name" is    "Tom"'
                .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}
    

    打印

    => ["hello", "\"my name\"", "is", "\"Tom\""]
    

    【讨论】:

    • 正则表达式的大剖析。很有帮助。
    • 如果您需要将引用的单词保持在一起,这是一个很好的解决方案! +1
    • 正则表达式的令人印象深刻的使用!您将如何调整此答案以保留my nameTom 上的引号? -- 即结果数组看起来像["hello", "my name", "is", "Tom"] 而不是["hello", "\"my name\"", "is", "\"Tom\""] -- 恕我直言,我相信@DarkCastle 提出的解决方案更好 有几个原因。请参阅我对该答案的评论。
    【解决方案2】:
    text = '   hello "my name" is    "Tom"'
    
    text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}
    

    生产:

    hello
    my name
    is
    Tom
    

    解释:

    0 个或多个空格后跟

    要么

    双引号内的一些单词或

    一个字

    后跟 0 个或多个空格

    【讨论】:

    • OP 的要求,如果没有前瞻,是不可能的。
    • 我的意思是原始解决方案,其中仅使用正则表达式进行拆分。任何后处理都不是我的想法。
    • 这个解决方案更好(更容易阅读;不需要太多解释;并且不保留引号)和更快(通过一百万次迭代大约一秒)如果稍微修改如下:text.scan(/\s*("([^"]+)"|\w+)\s*/).map { |match| match[1].nil? ? match[0] : match[1] } -- 结果:["hello", "my name", "is", "Tom"]
    【解决方案3】:

    你可以试试这个正则表达式:

    /\b(\w+)\b/
    

    它使用\b 来查找单词边界。这个网站http://rubular.com/ 很有帮助。

    【讨论】:

    • 这不起作用。它不会尝试将引号之间捕获为单个匹配项
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-12-24
    • 1970-01-01
    • 2019-12-31
    • 2021-10-30
    • 1970-01-01
    • 2014-06-30
    相关资源
    最近更新 更多