用逗号、引号和句号分割字符串.. 有一些例外答案

【问题标题】：Split a string by comma, quote and full-stop.. with a few exceptions用逗号、引号和句号分割字符串.. 有一些例外
【发布时间】：2012-06-16 14:55:58
【问题描述】：

我有很多文本，类似于以下段落，我想将其拆分为不带标点符号的单词（'、"、,、.、newline 等).. 有一些例外。

最初被认为是印度南部喀拉拉邦的查拉库迪河系统的特有种，但现在被认为在包括 Periyar、Manimala 和 Pamba 河在内的周边排水系统中分布更广泛，尽管 Manimala 的数据可能存在问题，因为它似乎是P. denisonii的类型产地。

在 Achankovil 河流域，它与 P. denisonii 同域出现，有时同位出现。

在过去 15 年左右的时间里，野生种群可能减少了多达 50%，尽管栖息地也因农业和国内污染以及涉及爆炸物的破坏性捕鱼方法而退化，但主要是为了水族馆贸易而收集或有机毒素。

文字指的是P. denisonii，它是一种鱼。它是Genus species 的缩写。我希望这个参考是一个词。

所以，例如，这是我希望看到的那种数组：

Array
(
    ...
    [44] given
    [45] it
    [46] seems
    [47] to
    [48] be
    [49] the
    [50] type
    [51] locality
    [52] of
    [53] P. denisonii
    [54] In
    [55] the
    ...
)

将这些物种引用（如P. denisonii）与新句子（如end. New）区分开来的唯一因素是：

P（对于 Puntius，如上述示例中的 P.）只有一个字母，总是大写
d（如 .denisonii）始终是小写字母或撇号 (')

preg_split 可以使用什么正则表达式来给我这样的数组？我尝试了一个简单的explode( " ", $array )，但它根本不起作用。

提前致谢，

【问题讨论】：

您可以使用explode 和str_replace 进行拆分，但我不确定P. denisonii...

标签： php regex preg-split

【解决方案1】：

改变你的方法：为什么不使用preg_match_all 而不是preg_split？您将匹配所有不包含分隔符的字符串，而不是使用拆分分隔符拆分文本。

将其与正则表达式一起使用：/([\S]+)|(P. denisonii)/ 以匹配所有非空白序列和序列“P. denisonii”

要排除逗号、引号和句号和其他字符，只需将 \S 替换为负正则表达式字符列表[^...]：

/([^\s,\.\"]+)|(P. denisonii)/ 匹配所有不包含空格（\s）、逗号、引号和点（\.）的序列

编辑： 以匹配通用属名（注意：我已更改您的文本以更好地测试代码，包括引用和虚假的属名）

$text = "Initially considered \"endemic\" to the Chalakudy River system in Kerala state, southern India, but now recognised to have a wider distribution in surrounding drainages including the Periyar, Manimala, and Pamba river though the Manimala data may be questionable given it seems to be the type locality of P. denisonii.

This is a bogus genus name, A. testii.

In the Achankovil River basin it occurs sympatrically, and sometimes syntopically, with P. denisonii.

Wild stocks may have dwindled by as much as 50% in the last 15 years or so with collection for the aquarium trade largely held responsible although habitats are also being degraded by pollution from agricultural and domestic sources, plus destructive fishing methods involving explosives or organic toxins.";


preg_match_all("/([A-Z]\. [a-z]+)|([^\s,\.\"]+)/", $text, $matches, PREG_PATTERN_ORDER);

echo "<pre>";
print_r($matches);

注意：您应该选择的数组是 $matches[0]，而不是 $matches

【讨论】：

嗨 Cranio - 我将扩展我的 OP 来解释，但我想拆分的不仅仅是 P. denisonii，它是任何采用 Genus（缩写为 G.）格式的物种参考，如A. panduro、S. daemon等
嗨，伙计-谢谢。出于某种原因，我收到以下条目：[55] => syntopically、[56] => with P、[57] => denisonii 而不是 P. denisonii 作为条目。
奇怪，我发布的代码适用于我的测试字符串。你使用的是同一个正则表达式吗？
是的，我复制并粘贴了您的preg_match_all，只是将$text 替换为我的$value。
你能在此处（或在 pastebin 或其他地方）发布您的 $value 以进行交叉检查吗？