在 PHP 中拆分、计数和格式化多字节字符答案

【问题标题】：Splitting, counting and formatting multibyte characters in PHP在 PHP 中拆分、计数和格式化多字节字符
【发布时间】：2020-08-11 16:42:42
【问题描述】：

我正在构建一个实验性的 PHP 应用程序来处理 西里尔 UTF-8 字符的诗歌。我想实现以下目标：

计算每个字符的出现次数以及“所有辅音”等类别的总计数。它可能包含特殊字符和标点符号，只要我可以在输出中隐藏其中一些。我使用 UTF-8，所以我只能使用多字节函数。不可能使用 count_chars() :(
保留换行符和大写。我保留了具有不同格式的原始文本的多个副本。它们可能看起来多余，但我想尽可能多地保留信息。
根据条件更改某些字符的 HTML 格式，例如给元音和辅音不同的背景颜色，或突出显示所选字符的每个出现。据我了解，首先我需要将我的字符串分成几行（以保留中断），然后将它们中的每一个转换为一个包含 1 个字符的块的数组。对于输出，我将 join() 行返回。不幸的是，我找不到任何关于如何将 HTML 应用于数组值来解决像我这样的问题的想法。

我尝试了什么

除了不知道该怎么做之外，我还遇到了一些小问题。这是我现在做的一步一步。

我通过post方法收集一首诗。英文诗仅用于说明目的。

文本示例：

We shall not cease from exploration 
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time.

我对步骤进行了编号，希望使评论更容易。

1.获取带标签和不带标签的值

这是通过 textarea 提交后在htmlentities() 中的样子：

$string = "We shall not cease from exploration<br /> And the end of all our exploring<br /> Will be to arrive where we started<br /> And know the place for the first time."

我如何输出换行符：

$poem = nl2br($string);

这是没有标签的副本：

$droptags = strip_tags($poem);

2。计数字符

这是我在count_chars() 的初步尝试，缺少计数循环：

$poem2array = preg_split('//u', $droptags, null, PREG_SPLIT_NO_EMPTY);
$unique_characters = array_unique($poem2array);

输出如下：

(
[0] => W
[1] => e
[2] => 
...
)

3.将行拆分为数组

分割成行：

$lines = preg_split('<br />', $showtags);

我的问题是数组看起来像这样：

(
[0] => We shall not cease from exploration<
[1] => >
And the end of all our exploring<
[2] => >
Will be to arrive where we started<
[3] => >
And know the place for the first time.
)

我尝试将文本拆分为嵌套数组。我知道它坏了，因为我只能得到最后一行。

foreach($lines as $line) {
      $line = preg_split('//u', $line, null, PREG_SPLIT_NO_EMPTY);
    }

4. HTML 样式

至于数组的 HTML 样式，我不知道。我的参考数组看起来像这样：

$vowels = array("a", "e", "i");
$consonants = array("b", "c", "d");

$fontcolor = array("vowels" => "blue",
                "consonants" => "orange");

【问题讨论】：

$lines = preg_split ('/
]*>/i', $string);试试这个没有 br 标签的数组。
如果你把这个问题分成多个帖子，我会更容易。我不确定其中哪一部分工作不正常，我应该解决哪一部分。
“这是它在 htmlentities() 中的样子”是什么意思？

标签： php html arrays string multibyte

【解决方案1】：

如果您想计算文本中元音和辅音的出现次数，您应该计算每个字母的出现次数，然后检查它是元音还是辅音。

要将字符串拆分为字符数组，您应该使用mb_str_split()。如果你坚持使用 PHP preg_split('//u', $line, null, PREG_SPLIT_NO_EMPTY);。

您可以使用array_count_values() 将数组减少为字母频率计数。然后只需分别计算元音和辅音即可。

要正确处理多字节字符串，您应该使用mbstring 扩展名。例如mb_strtolower 是strtolower() 的多字节版本，mb_str_split() 是str_split() 的多字节版本

<?php

$poem = <<<'POEM'
We shall not cease from exploration 
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time.
POEM;

$vowels = array("a", "e", "i", "o", "u");
$consonants = array_diff(range('a', 'z'), $vowels); // not necessary to diff because of elseif. Just for demonstration

$letterFrequencyInsesitive = array_count_values(mb_str_split(mb_strtolower($poem)));
$noVowels = 0;
$noConsonants = 0;
foreach ($letterFrequencyInsesitive as $letter => $freq) {
    if (in_array($letter, $vowels, true)) {
        $noVowels += $freq;
    } elseif (in_array($letter, $consonants, true)) {
        $noConsonants += $freq;
    }
}

echo 'Number of vowels: '.$noVowels.PHP_EOL;
echo 'Number of consonants: '.$noConsonants;

如果您想分别格式化每个字母，那么最简单的方法可能是将每个字母包装在 <span> 标记中并应用一个类。

$formattedOutput = '';
$fontcolor = array("vowels" => "blue",
    "consonants" => "orange");

foreach (mb_str_split($poem) as $char) {
    $lowercase = mb_strtolower($char);
    if (in_array($lowercase, $vowels, true)) {
        $formattedOutput .= '<span class="'.$fontcolor['vowels'].'">'.$char.'</span>';
    } elseif (in_array($lowercase, $consonants, true)) {
        $formattedOutput .= '<span class="'.$fontcolor['consonants'].'">'.$char.'</span>';
    } else {
        $formattedOutput .= $char;
    }
}

echo nl2br($formattedOutput);

【讨论】：

@Martin 我不明白你为什么要添加第一行。默认编码为 UTF-8。为什么我需要添加那行？
将其显示为确保页面默认编码正确的方法。 mbstring 字符集可以在 PHP.ini 中设置，并且不保证 UTF-8 是脚本中使用的默认编码。 PHP.ini 可以由其他人（例如共享主机）设置为其他内容，因为您没有在 mb_ 脚本上指定确切的编码。展示不需要的东西比不展示需要的东西要好。 :-)

【解决方案2】：

计数字符

for ($i=0;$i<=strlen($droptags);$i++) 
$count[$droptags[$i]]++;

将行分割成数组

在这种情况下，我不得不做一个棘手的事情。在这种情况下，我必须将标记从更改为另一个标记；否则 > 将始终出现

$showtags = "We shall not cease from exploration<br /> And the end of all our exploring<br /> Will be to arrive where we started<br /> And know the place for the first time.";
$showtags = str_replace(";",",",$showtags);
$showtags = str_replace("<br />",";",$showtags);
$lines = preg_split('/;/', $showtags);
foreach($lines as $line) {
    echo "lines= $line<BR>";
}

在你的代码中，我建议更改变量的名称，否则它会与循环中使用的变量 $line 混合

foreach($lines as $line) {
  $lineOut = preg_split('//u', $line, null, PREG_SPLIT_NO_EMPTY);
}

【讨论】：