正则表达式正在破坏我的 UTF-8 XML (PHP)答案

【问题标题】：Regex is destroying my UTF- 8 XML (PHP)正则表达式正在破坏我的 UTF-8 XML (PHP)
【发布时间】：2017-01-05 12:04:30
【问题描述】：

我有一个问题.. 我有一个代码正在下载一些 XML 文件并删除一些我不需要的标签。从此一切都找到了。我的 XML 文件是 UTF-8 格式，我没有问题。

但由于我添加了一个代码来替换和更改标题值，我的 XML 文件不再是 UTF-8 格式，并且我收到以下错误消息：

"D:\Anwendung\PHP 7\php-win.exe" C:\Users\Jan\PhpstormProjects\censored\test.php
PHP Warning:  DOMDocument::load(): Input is not proper UTF-8, indicate encoding !
Bytes: 0xE3 0xA4 0x63 0x68 in file:/C:/Users/Jan/PhpstormProjects/censored/data/gamesplanet.xml, line: 1423 in C:\Users\Jan\PhpstormProjects\censored\test.php on line 18
PHP Fatal error:  Uncaught Error: Call to a member function getElementsByTagName() on null in C:\Users\Jan\PhpstormProjects\censored\test.php:23
Stack trace:
#0 C:\Users\Jan\PhpstormProjects\censored\test.php(86): countAd('data/gamesplane...')
#1 {main}
  thrown in C:\Users\Jan\PhpstormProjects\censored\test.php on line 23

Process finished with exit code 255

在第 1423 行站着：W㥣hter Von Mittelerde

如果我不查看下面的代码，我不会收到任何错误消息，并且在第 1423 行：Wächter von Mittelerde

有人有想法可以帮助我吗？

代码：

function loadTitles($tagName, $path){

    $dom = new DOMDocument('1.0', 'utf-8');
    $dom->preserveWhiteSpace = false;
    $dom->formatOutput = true;
    $dom->load($path);

    $marker = $dom->getElementsByTagName($tagName);

    for ($i = $marker->length - 1; $i >= 0; $i--) {
        $word = $marker->item($i)->textContent;
        $escapedWord = escapWord($word);
        $escapedWord = modifyWord($escapedWord);
        $marker->item($i)->textContent = $escapedWord;
    }

    $dom->saveXML();
    $dom->save($path);
}
function escapWord($string){

    $replaceNothing = [":", ",", ";", "`", "#", "'", "´", "–", "!", "(", ")", ".", "@", "’", "+", "™"];
    $replaceSpace = ["-", "–", "_", "/", ":"];
    $delete = ["Steam", "Eu", "Key", "CD", "Gift", "Edition", "Pack", "Uplay", "Required", "Collection", "Origin", "HD", "Complete", "Digital", "Download", "EA", "Europa", "RPG", "Activated", "Access", "Code", "Limited", "Direct", "Bundle", "Special", "CDKEY", "GLOBAL", "EARLY", "ACCESS", "Card", "Cartel", "Player", "Trade", "DE", "GOG", "Multilanguage", "Multi", "Full", "Only", "UNCUT", "Cut", "Box", "Ps Vita", "VIP", "Rockstar", "Subscription"];

    $string= str_replace($replaceNothing, '', $string);
    $string= str_replace($replaceSpace, ' ', $string);
    $string= preg_replace('~\b(?:' . implode('|', $delete) . ')\b~i', '', $string);
    $string= str_replace("&amp;", ' & ', $string);
    $string= strtolower($string);
    $string= ucwords($string);
    $string= preg_replace('/\bAsia\b/i', 'ASIA', $string);
    $string= preg_replace('/\buk\b/i', 'UK', $string);
    $string= preg_replace('/\bAU\b/i', 'AU', $string);
    $string= preg_replace('/\bXBOX\b/i', 'XBOX ', $string);
    $string= preg_replace('/\bpc\b/i', 'PC', $string);
    $string= preg_replace('/\bus\b/i', 'US', $string);
    $string= preg_replace('/\bru\b/i', 'RUS', $string);
    $string= preg_replace('/\bRUS\b/i', 'RUS', $string);
    $string= preg_replace('/\bPS4\b/i', 'PS4', $string);
    $string= preg_replace('/\bAddon\b/i', 'AddOn', $string);
    $string= preg_replace('/\bPlay Station 4\b/i', 'PS4', $string);
    $string= preg_replace('/\bPs4\b/i', 'PS4', $string);
    $string= preg_replace('/\bPs3\b/i', 'PS3', $string);
    $string= preg_replace('/\bPlayStation 4\b/i', 'PS4', $string);
    $string= preg_replace('/\bPlay Station 3\b/i', 'PS3', $string);
    $string= preg_replace('/\bPlayStation 3\b/i', 'PS3', $string);
    $string= preg_replace('/\bPlayStation Network\b/i', 'PSN', $string);
    $string= preg_replace('/\bPSN\b/i', 'PSN', $string);
    $string= preg_replace('/\bXX\b/i', 'XX', $string);
    $string= preg_replace('/\bXIX\b/i', 'XIX', $string);
    $string= preg_replace('/\bXVIII\b/i', 'XVIII', $string);
    $string= preg_replace('/\bXVII\b/i', 'XVII', $string);
    $string= preg_replace('/\bXVI\b/i', 'XVI', $string);
    $string= preg_replace('/\bXV\b/i', 'XV', $string);
    $string= preg_replace('/\bXIV\b/i', 'XIV', $string);
    $string= preg_replace('/\bXiii\b/i', 'XIII', $string);
    $string= preg_replace('/\bXii\b/i', 'XII', $string);
    $string= preg_replace('/\bXi\b/i', 'XI', $string);
    $string= preg_replace('/\bIX\b/i', 'IX', $string);
    $string= preg_replace('/\bVIII\b/i', 'VIII', $string);
    $string= preg_replace('/\bVII\b/i', 'VII', $string);
    $string= preg_replace('/\bVI\b/i', 'VI', $string);
    $string= preg_replace('/\bV\b/i', 'V', $string);
    $string= preg_replace('/\bIV\b/i', 'IV', $string);
    $string= preg_replace('/\bIII\b/i', 'III', $string);
    $string= preg_replace('/\bII\b/i', 'II', $string);
    $string= preg_replace('/\bdlc\b/i', 'DLC', $string);
    $string= trim(preg_replace('/\s\s+/', ' ', str_replace("\n", " ", $string)));

    return $string;
}
function modifyWord($string){

    if(strpos($string, "Counter Strike Offensive") !== false){
        $newstring = explode("Offensive", $string);;
        $newstring[0] = $newstring[0] . "Global Offensive";
        $string = $newstring[0] . $newstring[1];
    }

    return $string;
}

您好，谢谢！

【问题讨论】：

问题是你使用了不支持多字节字符的函数（str_replace、ucwords、strtolower、preg_replace，没有 u 修饰符）和多字节字符串（UTF8） .请改用mb_ 函数并将u 修饰符与preg_replace 一起使用。
注意preg_replace可以将数组作为第一个和第二个参数。
你能给我一个代码 sn-p 我该怎么做？ - 因为我不知道 mb_functions 是什么意思，“u 修饰符”是什么意思？
见php.net/manual/en/ref.mbstring.php和php.net/manual/en/reference.pcre.pattern.modifiers.php。
1) 将strtolower 替换为mb_strtolower，将ucwords 替换为mb_ucwords 等 2) 在正则表达式末尾添加u ("/something/iu").

标签： php regex xml utf-8

【解决方案1】：

您应该使用 u 修饰符为您的模式激活 unicode 模式。这意味着您将匹配 unicode 字符和代码点，而不是单个字节。 Wächter 中的 ä 由几个字节组成，其中一个在单字节模式下被解释为字尾。

preg_match('(.)u', 'äöü', $match);
var_dump($match);

输出：

array(1) {
  [0]=>
  string(2) "ä"
}

如您所见，该示例匹配第一个字符，而不仅仅是第一个字节。接下来是使用数组作为preg_replace() 的参数的可能性。这使您可以简化调用。

var_dump(preg_replace(['(ä)u', '(ü)u'], '_', 'äöü'));

输出：

string(4) "_ö_"

但更好的选择可能是在您的模式中使用字符类和| 运算符。 $replaceNothing 和 $replaceSpace 是字符数组，可以改成字符类：

$replaceWithNothing = '([,;`#\'´!().@’+™]+|(?:\b(?:Steam|Eu|Key)\b))u';
$replaceWithSpace = '([-–_/:]+)u';

var_dump(
  preg_replace(
    [$replaceWithNothing, $replaceWithSpace], 
    ['', ' '], 
    'remove (™) and :replace:'
  )
);

对于单词替换：

$replaceWords = [
  '(\bAsia\b)ui' => 'ASIA';
  '(\buk\b)ui', 'UK'
);
$output = preg_replace(array_keys($words), $words, $input);

我不确定您为什么不对modifeWord() 函数使用简单的替换。您将第一次出现的 Counter Strike Offensive" 替换为 Counter Strike Global Offensive"。

cmets 提到使用 mb_* 函数。我建议使用更现代的ICU grapheme functions。这是 PHP 中用于处理 unicode 的标准、更现代和更强大的扩展。

【讨论】：

但是这个并没有解决我的$string= mb_strtolower($string, 'UTF-8');问题...
那种，最好strtolower()会破坏UTF-8，它是一个ANSI（单字节）函数。 ucwords() 也是如此。但我认为您不需要转换字符串变量 - 模式使用 i（不区分大小写）修饰符。加上u 应该就够了。
我只使用 strtolower() 函数，因为我希望所有字符串都相等。所以我调用 strtolower() 函数，然后我想说 ucwords() 所以它们看起来都不错
mb_*() 函数旨在为处理多字节字符串的标准字符串函数提供替换。它们是默认安装的一部分，我预计它们不会在不久的将来被删除。所以你可以使用它们。但是，我建议牢记 ICU 功能并了解它们。它们不仅是 Unicode 字符串函数，而且是国际化和本地化函数。
好的，谢谢！ - 但我的问题是这个，我没有解决它......这很烦人！ [链接]stackoverflow.com/questions/41486578/…