【问题标题】:perl parse misformatted bracketed textperl 解析格式错误的括号文本
【发布时间】:2011-11-18 19:33:47
【问题描述】:

我将一串文本分成多个词组,每个词组都用方括号括起来:

[pX textX/labelX] [pY textY/labelY] [pZ textZ/labelZ] [textA/labelA]

有时块不以 p 字符开头(如上面的最后一个)。

我的问题是我需要捕获每个块。这在正常情况下是可以的,但有时这个输入的格式是错误的,例如,一些块可能只有一个括号,或者没有。所以它可能看起来像这样:

 [pX textX/labelX] pY textY/labelY] textZ/labelZ

但它应该是这样的:

 [pX textX/labelX] [pY textY/labelY] [textZ/labelZ]

问题不包括嵌套括号。在以前所未有的方式深入研究大量不同人的正则表达式解决方案(我是正则表达式的新手),下载备忘单并获得正则表达式工具(Expresso)之后,我仍然不知道该怎么做。有任何想法吗?也许正则表达式不起作用。但是这个问题是如何解决的呢?我想这不是一个非常独特的问题。

编辑

这是一个具体的例子:

$data= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";

这是来自@FailedDev 的一个非常紧凑的解决方案:

while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) { # matched text = $& }

但我认为需要补充两点来强调这个问题:

  1. 有些块根本没有括号
  2. ,/PUNCw#hm/CC_PRP_MP3] 是需要分开的独立块。

但是,由于这种情况是固定的(即,一个标点符号后跟一个右侧只有一个方括号的文本/标签模式),我将它硬编码到这样的解决方案中:

my @stuff;
while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
    if($& =~ m/(^[\S]\/PUNC )(.*\])/) # match a "./PUNC" mark followed by a "phrase]"
    {
        @bits = split(/ /,$&); # split by space
        push(@stuff, $bits[0]); # just grab the first chunk before space, a PUNC
        push(@stuff, substr($&, 7)); # after that space is the other chunk
    }
    else { push(@stuff, $&); } 
}
foreach(@stuff){ print $_; }

尝试我在编辑中添加的示例,除了一个问题外,它工作得很好。最后一个 ./PUNC 被遗漏了,所以输出是:

[VP sysmH/VBD_MS3]
[PP ll#/IN_DET Axryn/NNS_MP]
,/PUNC
w#hm/CC_PRP_MP3]
[NP AEDA'/NN]
,/PUNC
[PP b#/IN m/NN_FS]
[NP >HyAnA/NN]

我怎样才能保留最后一个块?

【问题讨论】:

  • 这不是你之前的查询吗:HERE
  • 没有。这仅适用于带或不带括号的块。这包括缺少一个括号和一个存在的块。
  • 糟糕,我确实错了。我很抱歉。

标签: regex string perl parsing tags


【解决方案1】:

你可以用这个

/(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*)/

假设你的字符串是这样的:

[pX textX/labelX] pY textY/labelY]  pY textY/labelY]  pY textY/labelY]  [pY textY/labelY] [3940-823490-2 [30-94823049 [32904823498]

它不适用于例如:pY [[[textY/labelY]

Perl 特定解决方案:

while ($subject =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
    # matched text = $&
}

更新:

/(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s+[^[]+?(?:\s+|$))/

这适用于您更新的字符串,但如果需要,您应该修剪结果的空白。

更新:2

/(\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s*[^[]+?(?:\s+|$))/

我建议打开一个不同的问题,因为您的原始问题与上一个问题完全不同。

"
(                 # Match the regular expression below and capture its match into backreference number 1
                     # Match either the regular expression below (attempting the next alternative only if this one fails)
      \[                # Match the character “[” literally
      [^[]              # Match any character that is NOT a “[”
         *?                # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
      ]                 # Match the character “]” literally
   |                 # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      [^[ ]             # Match a single character NOT present in the list “[ ”
      .                 # Match any single character that is not a line break character
         *?                # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
      ]                 # Match the character “]” literally
   |                 # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
      \[                # Match the character “[” literally
      [^[ ]             # Match a single character NOT present in the list “[ ”
         *                 # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   |                 # Or match regular expression number 4 below (the entire group fails if this one fails to match)
      \s                # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
         *                 # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
      [^[]              # Match any character that is NOT a “[”
         +?                # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
      (?:               # Match the regular expression below
                           # Match either the regular expression below (attempting the next alternative only if this one fails)
            \s                # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
               +                 # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         |                 # Or match regular expression number 2 below (the entire group fails if this one fails to match)
            $                 # Assert position at the end of the string (or before the line break at the end of the string, if any)
      )
)
"

【讨论】:

  • 谢谢,这适用于上面的例子。但是,它不包括 no 括号的情况。我正在根据您的解决方案编辑问题并进行了处理,但需要多一步才能完成。
  • 这很好,但最后一个块仍然不知何故消失了。例如,如果我尝试这个,最后一个 ./PUNC 不会显示: [S w#/CC] [VP sy$Ark/VBD_MS3] "/PUNC ./PUNC
【解决方案2】:
s{
   \[?
   (?: ([^\/]\s]+) \s+ )?
   ([^\]/\s]+)
   /
   ([^\]/\s]+)
   \]?
}{
   '[' .
   ( defined($1) ? "$1 " : '' ) .
   $2 .
   '/' .
   $3 .
   ']'
}xeg;

【讨论】:

    【解决方案3】:

    这与我应用于您的previous problem 的过程基本相同,我只是稍微更改了map

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my $string= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m\$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";
    
    my @items = split(/(\[.+?\])/, $string);
    
    my @new_items = map { 
                         if (/^\[.+\]$/) { # items in []
                            $_;
                         } 
                         elsif (/\s/) {
                            grep m/\w/, split(/\s+/); # use grep to eliminate the split results that are the empty string
                         }
                         else { # discard empty strings
                         }
                        } @items;
    
    print "--$_--\n" for @new_items;
    

    你得到的输出是这样的(连字符只是为了说明没有前导/尾随空格):

    --[VP sysmH/VBD_MS3]--
    --[PP ll#/IN_DET Axryn/NNS_MP]--
    --,/PUNC--
    --w#hm/CC_PRP_MP3]--
    --[NP AEDA'/NN]--
    --,/PUNC--
    --[PP b#/IN m$Arkp/NN_FS]--
    --[NP >HyAnA/NN]--
    --./PUNC--
    

    我认为这是您想要获得的结果。我不知道您是否会对非“仅正则表达式”的解决方案感到满意...

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-07-30
      • 2011-01-20
      • 1970-01-01
      • 1970-01-01
      • 2016-05-16
      • 2010-11-03
      • 2016-08-09
      相关资源
      最近更新 更多