PHP用html内容解析xml答案

【问题标题】：PHP parse xml with html contentPHP用html内容解析xml
【发布时间】：2011-12-06 12:43:52
【问题描述】：

是否有可能在 php 中使用默认的 xml 类来解析 xml 文件，使得只有一个命名空间中的元素才被认为是 xml？我想解析其中一些元素包含 html 代码的 xml 文件，最好我不想用 cdata 标签封装每个元素，或者转义所有特殊字符。由于 html 的语法与 xml 非常相似，因此大多数解析器将无法正确解析。

例子：

<ns:root>
    <ns:date>
        06-12-2011
    </ns:date>
    <ns:content>
        <html>
        <head>
        <title>Sometitle</title>
        </head>
        <body>
        --a lot of stuff here
        </body>
        </html>
    </ns:content>
</ns:root>

在这个例子中，我希望里面的所有 html 内容都是该元素的内容，并且它本身不应该被解析。使用 simplexml 等默认解析器可以做到这一点，还是我应该编写自己的解析器？

编辑：让我更好地解释一下我的情况：我想创建一个小的个人 php 框架，其中代码与 HTML 分离（类似于 MVC，但不完全相同）。但是，许多 HTML 代码在多个页面中是相同的，但并非所有内容都是相同的，并且一些数据来自例如应该在某些页面中插入数据库，与普通网站没有什么不同。所以我想出了使用单独的 html 组件文件的想法，这些文件可以通过 html 脚本进行解析。这看起来像这样：

main.fw:

<html>
<head>
    <title>
        <fw:placeholder name="title" />
    </title>
</head>
<body>
    <div id="menubar">
        <ul>
            <li>page1</li>
            <li>page2</li>
        </ul>
    </div>
    <div id="content>
        <fw:placeholder name="maincontent" />
    </div>
</body>
</html>

page1.fw

<fw:component file="main.fw">
    <fw:content name="title">
        page1
    </fw:content>
    <fw:content name="maincontent" />
        some content with html
    </fw:content>
</fw:component>

解析后的结果：第 1 页

第1页
第2页

一些带有 html 的内容

这个问题主要是关于第二种类型的文件，其中html嵌套在xml元素中。

【问题讨论】：

这已经被做过百万次了。看看其他 PHP CMS 系统是如何做到的，我猜他们已经找到了一种被证明很好的方法。
我已经认为很多人之前已经这样做了，这就是为什么我认为它应该是可能的。您是否碰巧知道使用类似内容的 CMS？

标签： php xml parsing

【解决方案1】：

包含非 XML 部分的 XML 文件不是 XML 文件。因此，您不能期望 XML 解析器能够解析它。对于要成为 XML 的文档，整个内容都必须是 XML。

您所要求的本质上是“是否有一个解析器可以解析我编造的尖括号语言。”也许DOMDocument->loadHTML() 或html5lib 会按照你的期望来解释，但不能保证。

包含的“html”位是有效的 XML 真的是一个可怕的负担吗？无论如何，这是一个很好的 HTML 卫生，如果你愿意这样做，你可以很容易地用 XSL 模板实现你的整个视图系统。节点感知模板系统的大部分好处正是您可以直接操作节点，并且可以很好地保证最终文档是有效的。为什么有节点感知的负担却没有任何好处？您不妨像其他所有模板系统一样使用基于字符串的系统。至少会更快。

请注意，一旦您构建了最终的 DOM，您就可以将其输出为其他内容，例如 HTML，因此仅仅因为您的所有输入模板都是 XML 并不意味着您的输出必须是。

【讨论】：

这个系统并不是为了给一个php应用程序提供一个接口来解析或编辑html。该系统旨在将 gui 和 html 完全与 php 代码分开，就像 MVC 一样。但是，如果我们用 MVC 术语来说，我希望视图是模块化的。我使用 ASP.NET MVC 作为灵感，但那里只有一个母版页（如我的示例中的 main.fw）和每个页面 1 视图（如 page1.fw）。我想创建一个类似的系统，但不受只有 1 个可重用视图组件（母版页）的限制。（没有剩余字符...）
让 html 位成为有效的 xml 对我来说不是负担，一点也不，但是这仍然给我带来了一些问题，例如 html 页面的 doctype 定义。同样，这个系统旨在将组件“合并”在一起，而不是抱怨 html 页面的 xml 特性。顺便说一句，如果这有点吹毛求疵，我很抱歉。
您可能会喜欢VTE。我相信它可以支持您正在做的事情，并且他们已经处理了这些细节。

【解决方案2】：

使用DOMDocument时可以使用textContent：http://www.php.net/manual/en/class.domnode.php

【讨论】：

但是这怎么解析一个xml文件呢？据我从文档中了解，此函数从 PHP DOMNode 读取一些内容。但是，我想将 xml 文件加载到某种形式的 php xml 表示中。所以问题在于将它放入 domnode，而不是从中获取。

【解决方案3】：

您希望将 HTML 代码视为非 XML 代码，而这正是字符数据 (CDATA) 的设计目的。

<ns:root>
    <ns:date>
        06-12-2011
    </ns:date>
    <ns:content>
        <![CDATA[
            <html>
            <head>
            <title>Sometitle</title>
            </head>
            <body>
            --a lot of stuff here
            </body>
            </html>
        ]]>
    </ns:content>
</ns:root>

最好依赖它而不是编写自己的解析器。使用XMLWriter::writeCData() 方法编写CDATA 部分。

重要提示： CDATA 部分中的 HTML 标记不需要编码！

引用Wikipedia CDATA:

但是，如果这样写：

<![CDATA[<sender>John Smith</sender>]]>

然后代码被解释为就像这样编写的一样：

&lt;sender&gt;John Smith&lt;/sender&gt;

【讨论】：

重点是，XML 文件应该是人类友好的（无论是阅读还是写作），我认为 CDATA 标记对人类来说不是那么友好，所以我最好避免使用这些。我希望我可以指示 xml 解析器仅将某个命名空间解析为 xml。作为一个丑陋的解决方法，我可以在将这些 CDATA 标记与 regex 提供给 xml 解析器之前将其插入，因为我知道它们应该在哪里，但这是很多不必要的处理并且很容易出错
为什么不将 HTML（我假设的内容）和 XML（我假设的元数据）分成不同的文件？这样你就不需要弄乱一个丑陋的解决方案，甚至可以通过在浏览器中打开文件来预览 HTML。
我在原始问题中添加了更多细节。它可能会让您更好地了解我想要实现的目标
查看我编辑的帖子。使用 CDATA 你不需要转义 HTML，所以它仍然是人类可读的！
但我也不想要 CDATA 标签，最好是，即使它有点吹毛求疵。我只想要我自己使用的代码，我不想插入 xml 解析器所需的任何代码，或者至少尽可能少地插入。我知道我可以使用 CDATA 标签，但这个问题是如何避免这些。

【解决方案4】：

我决定创建一个简单的解析器来查看结果。由于我不解析有效的 XML，从现在开始我将它称为 XMLIsh。

解析器实际上工作得很好，而且性能也不错：我做了一些测试，我发现它在有效的 xml 文档上只比 SimpleXMLElement 慢 10 倍，而 SimpleXMLElement 是在 php 功能中构建的，并且我的功能只是 php。这个解析器也适用于“XMLIsh”文档，如前所述。因此，只要不需要超快的速度，这可能是一个有效的解决方案。

在我的情况下，这些文档只是偶尔解析一次，因为输出是缓存的，所以我认为这对我有用。

无论如何，这是我的代码：

/**
 * This function parses a string as an XMLIsh document. An XMLIsh document is very similar to xml, but only one namespace should be parsed. 
 * 
 * parseXMLish walks through the document and creates a tree while doing so. 
 * Each element will be represented as an array, with the following content:
 * -index = 0: An array with as first element (index = 0) the type of the element. All following elements are its arguments with index=name and value=value.
 * -index = 1: Optional:an array with the content of this element. If the content is a string, this array will only have one element, namely the content of the string.
 * 
 * @param &$string The XMLIsh string to be parsed
 * @param $namespace The namespace which should be parsed.
 * @param &$offset The starting point of parsing. Default = 0
 * @param $previousTag The current opening tag. This argument shouldn't be set manually, this argument is needed for this function to check if a closing tag is valid.
 */
function parseXMLish(&$string,$namespace,&$offset=0,$openingTag = ""){
    //Whitespace doesn't matter, so trim it:)
    $string = trim($string);
    $result = array();
    //We need to find our mvc elements. These elements use xml syntax and should have the namespace mvc. 
    //Opening, closing and self closing tags are found.
    while(preg_match("/<(\/)?{$namespace}:(\w*)(.*?)(\/)?>/",$string,$matches,PREG_OFFSET_CAPTURE,$offset)){
        //Before our first mvc element, other text might have been found (e.g. html code). 
        //This should be added to our result array first. Again, strip the whitespace.
        $preText = substr($string,$offset,$matches[0][1]-$offset);
        $trimmedPreText = trim($preText);
        if (!empty($trimmedPreText))
            $result[] = $trimmedPreText;
        //We could have find 2 types of tags: closing and opening (including self closing) tags.
        //We need to distinguish between those two.
        if ($matches[1][0] == ''){
            //This tag was an opening tag. This means we should add this to the result array.
            //We add the name of this tag to the element first.
            $result[][0][0] = $matches[2][0];
            //Tags can also have arguments. We will find them here, and store them in the result array.
            preg_match_all("/\s*(\w+=[\"']?\S+[\"'])/",$matches[0][0],$arguments);
            foreach($arguments[1] as $argument){
                list($name,$value)=explode("=",$argument);
                $value = str_replace("\"","",$value);
                $value = str_replace("'","",$value);
                $result[count($result)-1][0][$name]=$value;
            }
            //We need to recalculate our offset. So lets do that. 
            $offset +=  strlen($preText) + strlen($matches[0][0]);
            //Now we will have to fill our element with content. 
            //This is only necessary if this is a regular opening tag, and not a self-closing tag.
            if (!(isset($matches[4]) && $matches[4][0] == "/")){
                $content = parseXMLish($string, $namespace, $offset,$matches[2][0]);                
            }
            //Only add content when there is any. 
            if (!empty($content))
                $result[count($result)-1][] = $content;
        }else{
            //This tag is a closing tag. It means that we only have to update the offset, and that we can go one level up
            //That is: return what we have so far back to the previous level. 
            //Note: the closing tag is the closing tag of the previous level, not of the current level. 
            if ($matches[2][0] != $openingTag)
                throw new Exception("Closing tag doesn't match the opening tag. Opening tag: $previousTag. Closing tag: {$matches[2][0]}");
            $offset +=  strlen($preText) + strlen($matches[0][0]);
            return $result;
        }
    }
    //If we have any text left after our last element, we should add that to the array too.
    $postText = substr($string,$offset);
    if (!empty($postText))
        $result[] = $postText;

    //We're done!
    return $result;     
}

【讨论】：