【问题标题】:PHP sentence boundaries including empty lines?PHP句子边界包括空行?
【发布时间】:2016-03-07 07:10:47
【问题描述】:

这是 SO 上 PHP sentences boundaries question 的扩展。

我想知道如何更改正则表达式以保留换行符。

示例代码逐句拆分一些文本,删除一个句子,然后重新组合:

<?php
$re = '/# Split sentences on whitespace between them.
    (?<=                # Begin positive lookbehind.
      [.!?]             # Either an end of sentence punct,
    | [.!?][\'"]        # or end of sentence punct and quote.
    )                   # End positive lookbehind.
    (?<!                # Begin negative lookbehind.
      Mr\.              # Skip either "Mr."
    | Mrs\.             # or "Mrs.",
    | Ms\.              # or "Ms.",
    | Jr\.              # or "Jr.",
    | Dr\.              # or "Dr.",
    | Prof\.            # or "Prof.",
    | Sr\.              # or "Sr.",
    | T\.V\.A\.         # or "T.V.A.",
                        # or... (you get the idea).
    )                   # End negative lookbehind.
    [\s+|^$]            # Split on whitespace between sentences/empty lines.
    /ix';

$text = <<<EOL
This is paragraph one. This is sentence one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!
EOL;

echo "\nBefore: \n" . $text . "\n";

$sentences = preg_split($re, $text, -1);

$sentences[1] = " "; // remove 'sentence one'

// put text back together
$text = implode( $sentences );

echo "\nAfter: \n" . $text . "\n";
?>

运行这个,输出是

Before: 
This is paragraph one. This is sentence one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!

After: 
This is paragraph one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!

我试图让“之后”文本与“之前”文本相同,只是删除了一个句子。

After: 
This is paragraph one. Sentence two!

This is paragraph two. This is sentence three. Sentence four!

我希望这可以通过正则表达式调整来完成,但我错过了什么?

【问题讨论】:

  • 看起来这个正则表达式存在问题:[\s+|^$] 真正匹配空格、+|^$ 符号。将其替换为(?:\h+|^$),我猜就是这样。
  • 我认为你可以删除\s\s{1} 之后的+,如果你真的需要它来匹配一个,因为\s+ 正在消耗其他空格。基本上你需要array( "stuf", "\n", "stuff");,但如果不测试它就不确定,而且它太复杂了,无法在我的脑海中运行。

标签: php regex


【解决方案1】:

模式的结尾应替换为:

  (?:\h+|^$)          # Split on whitespace between sentences\/empty lines.
/mix';

IDEONE demo

请注意,[\s+|^$] 真正匹配空格(水平和 垂直,如换行符)、+|^$ 符号,因为它是一个 字符类

而不是字符类,一个组(更好,这里不捕获)是必要的。在组内(标有(...)),| 用作交替运算符。

我建议不要使用\s,而是使用匹配水平空格(无换行符)的\h

如果没有使用/m 多行修饰符,^$ 只会匹配空字符串。所以,我在选项中添加了/m 修饰符。

请注意,我必须在最后一条评论中转义 /,否则会出现正则表达式不正确的警告。或者,使用不同的正则表达式分隔符。

【讨论】:

  • 谢谢。这几乎可行,有一个怪癖:preg_split 正则表达式将两个句子组合在一起。见ideone.com/AUImET有什么想法吗?也感谢 \h 解释我不熟悉它。
  • 如果添加PREG_SPLIT_DELIM_CAPTURE,使用带有(\h+|^$) 的捕获组并将索引2 处的元素清零怎么办?见this demo
猜你喜欢
  • 2011-06-29
  • 1970-01-01
  • 2019-06-23
  • 1970-01-01
  • 2015-05-16
  • 1970-01-01
  • 1970-01-01
  • 2013-06-11
  • 2015-12-13
相关资源
最近更新 更多