Ruby正则表达式提取单词答案

【问题标题】：Ruby regex extracting wordsRuby正则表达式提取单词
【发布时间】：2011-12-31 00:54:32
【问题描述】：

我目前正在努力想出一个正则表达式，它可以将字符串拆分为单词，其中单词被定义为由空格包围或用双引号括起来的字符序列。我正在使用String#scan

比如字符串：

'   hello "my name" is    "Tom"'

应该匹配单词：

hello
my name
is
Tom

我设法匹配用双引号括起来的单词：

/"([^\"]*)"/

但我不知道如何将空格字符包围起来以获得“你好”、“是”和“汤姆”，同时又不搞砸“我的名字”。

对此的任何帮助将不胜感激！

【问题讨论】：

标签： ruby regex string word

【解决方案1】：

result = '   hello "my name" is    "Tom"'.split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/)

会为你工作。它会打印

=> ["", "hello", "\"my name\"", "is", "\"Tom\""]

忽略空字符串。

说明

"
\\s            # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   +             # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?=           # Assert that the regex below can be matched, starting at this position (positive lookahead)
   (?:           # Match the regular expression below
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
      [^\"]          # Match any character that is NOT a “\"”
         *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      \"             # Match the character “\"” literally
   )*            # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   [^\"]          # Match any character that is NOT a “\"”
      *             # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   \$             # Assert position at the end of a line (at the end of the string or before a line break character)
)
"

您可以像这样使用reject 来避免空字符串

result = '   hello "my name" is    "Tom"'
            .split(/\s+(?=(?:[^"]*"[^"]*")*[^"]*$)/).reject {|s| s.empty?}

打印

=> ["hello", "\"my name\"", "is", "\"Tom\""]

【讨论】：

正则表达式的大剖析。很有帮助。
如果您需要将引用的单词保持在一起，这是一个很好的解决方案！ +1
正则表达式的令人印象深刻的使用！您将如何调整此答案以不保留my name 和Tom 上的引号？ -- 即结果数组看起来像["hello", "my name", "is", "Tom"] 而不是["hello", "\"my name\"", "is", "\"Tom\""] -- 恕我直言，我相信@DarkCastle 提出的解决方案更好有几个原因。请参阅我对该答案的评论。

【解决方案2】：

text = '   hello "my name" is    "Tom"'

text.scan(/\s*("([^"]+)"|\w+)\s*/).each {|match| puts match[1] || match[0]}

生产：

hello
my name
is
Tom

解释：

0 个或多个空格后跟

要么

双引号内的一些单词或

一个字

后跟 0 个或多个空格

【讨论】：

OP 的要求，如果没有前瞻，是不可能的。
我的意思是原始解决方案，其中仅使用正则表达式进行拆分。任何后处理都不是我的想法。
这个解决方案更好（更容易阅读；不需要太多解释；并且不保留引号）和更快（通过一百万次迭代大约一秒）如果稍微修改如下：text.scan(/\s*("([^"]+)"|\w+)\s*/).map { |match| match[1].nil? ? match[0] : match[1] } -- 结果：["hello", "my name", "is", "Tom"]

【解决方案3】：

你可以试试这个正则表达式：

/\b(\w+)\b/

它使用\b 来查找单词边界。这个网站http://rubular.com/ 很有帮助。

【讨论】：

这不起作用。它不会尝试将引号之间捕获为单个匹配项