【发布时间】:2016-03-07 07:10:47
【问题描述】:
这是 SO 上 PHP sentences boundaries question 的扩展。
我想知道如何更改正则表达式以保留换行符。
示例代码逐句拆分一些文本,删除一个句子,然后重新组合:
<?php
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?] # Either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
# or... (you get the idea).
) # End negative lookbehind.
[\s+|^$] # Split on whitespace between sentences/empty lines.
/ix';
$text = <<<EOL
This is paragraph one. This is sentence one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!
EOL;
echo "\nBefore: \n" . $text . "\n";
$sentences = preg_split($re, $text, -1);
$sentences[1] = " "; // remove 'sentence one'
// put text back together
$text = implode( $sentences );
echo "\nAfter: \n" . $text . "\n";
?>
运行这个,输出是
Before:
This is paragraph one. This is sentence one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!
After:
This is paragraph one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!
我试图让“之后”文本与“之前”文本相同,只是删除了一个句子。
After:
This is paragraph one. Sentence two!
This is paragraph two. This is sentence three. Sentence four!
我希望这可以通过正则表达式调整来完成,但我错过了什么?
【问题讨论】:
-
看起来这个正则表达式存在问题:
[\s+|^$]真正匹配空格、+、|、^和$符号。将其替换为(?:\h+|^$),我猜就是这样。 -
我认为你可以删除
\s或\s{1}之后的+,如果你真的需要它来匹配一个,因为\s+正在消耗其他空格。基本上你需要array( "stuf", "\n", "stuff");,但如果不测试它就不确定,而且它太复杂了,无法在我的脑海中运行。