【问题标题】:Get text inside paragraph tag in xml file in specific node在特定节点的 xml 文件中获取段落标记内的文本
【发布时间】:2013-11-25 18:14:39
【问题描述】:

我有这个 xml 文件

http://www.metacafe.com/tags/cats/rss.xml

使用此代码:

$xml = simplexml_load_file('http://www.metacafe.com/tags/cats/rss.xml', 'SimpleXMLElement', LIBXML_NOCDATA);
echo $xml->channel->item->title . "<br>";
echo $xml->channel->item->description . "<br>";

我得到这个输出:

Dad Challenges Kids to Climb Walls to Get Candy<br>
<a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/"><img src="http://s3.mcstatic.com/thumb/11150410/28824820/4/directors_cut/0/1/dad_challenges_kids_to_climb_walls_to_get_candy.jpg?v=1" align="right" border="0" alt="Dad Challenges Kids to Climb Walls to Get Candy" vspace="4" hspace="4" width="134" height="78" /></a>
                <p>
                Nick Dietz compiles some of the week's best viral videos, 
                including an elephant trying really hard to break a stick, a cat
                sunbathing and kids climbing up the walls to get candy. Plus, 
                making  music with a Ford Fiesta.                              
                <br>Ranked <strong>4.00</strong> / 5 | 78 views | <a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/">0 comments</a><br/>
                </p>
                <p>
                 <a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/"><strong>Click here to watch the video</strong></a> (02:38)<br/>
                    Submitted By:                       <a href="http://www.metacafe.com/channels/CBS/">CBS</a><br/>
                    Tags:
                    <a href="http://www.metacafe.com/topics/penna/">Penna</a>&nbsp;
                    <a href="http://www.metacafe.com/topics/bjbj/">Bjbj</a>&nbsp;
                    <a href="http://www.metacafe.com/topics/ciao/">Ciao</a>&nbsp;                   <br/>
                    Categories: <a href='http://www.metacafe.com/videos/entertainment/'>Entertainment</a>
               </p>

        <br>

我需要得到这个输出(而不是它需要删除所有其他元素):

Dad Challenges Kids to Climb Walls to Get Candy
Nick Dietz compiles some of the week's best viral videos, 
including an elephant trying really hard to break a stick, a cat
sunbathing and kids climbing up the walls to get candy. Plus, 
making  music with a Ford Fiesta.

我不知道如何继续得到这个结果。

【问题讨论】:

  • 它是 html...您已经在使用 DOM 操作来获取 xml 节点。这是一个简单的扩展,可以撕开该节点中的 html,只吸出你想要的部分。
  • 你能给我举个例子吗?
  • 请注意,将LIBXML_NOCDATA 传递给SimpleXML 是完全没有必要的;只要您请求元素的字符串内容,所有 CDATA 和文本节点都会被适当地展平。如果您正在执行 echo 以外的其他操作,则强制变量为字符串的语法是 (string)$var,例如$html = (string)$xml-&gt;channel-&gt;item-&gt;description.

标签: php xml-parsing simplexml


【解决方案1】:

您在描述中获取元素的原因是 CDATA 部分。对于 XML-Parser,CDATA 会话的内容始终是文本。 &lt;p&gt; 之类的元素不会读入 DOM 结构中。

一个简单的strip_tags() 将删除所有元素。要获得更多控制,您需要将 html 片段加载到 DOM 中:

$html = <<<'HTML'
<a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/"><img src="http://s3.mcstatic.com/thumb/11150410/28824820/4/directors_cut/0/1/dad_challenges_kids_to_climb_walls_to_get_candy.jpg?v=1" align="right" border="0" alt="Dad Challenges Kids to Climb Walls to Get Candy" vspace="4" hspace="4" width="134" height="78" /></a>
                <p>
                Nick Dietz compiles some of the week's best viral videos, 
                including an elephant trying really hard to break a stick, a cat
                sunbathing and kids climbing up the walls to get candy. Plus, 
                making  music with a Ford Fiesta.                              
                <br>Ranked <strong>4.00</strong> / 5 | 78 views | <a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/">0 comments</a><br/>
                </p>
                <p>
                 <a href="http://www.metacafe.com/watch/cb-M0fIp1ctKtsn/dad_challenges_kids_to_climb_walls_to_get_candy/"><strong>Click here to watch the video</strong></a> (02:38)<br/>
                    Submitted By:                       <a href="http://www.metacafe.com/channels/CBS/">CBS</a><br/>
                    Tags:
                    <a href="http://www.metacafe.com/topics/penna/">Penna</a>&nbsp;                 <br/>
                    Categories: <a href='http://www.metacafe.com/videos/entertainment/'>Entertainment</a>
               </p>

        <br>
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);

$content = $xpath->evaluate("string(//p[1]/text())");
var_dump($content);

Xpath 表达式

//p/text()[1] 是 p 中的第一个文本节点。 string() 函数将其转换为字符串。如果节点不存在,表达式将返回一个空字符串。

【讨论】:

  • 编辑了一个添加的例子。
  • 请再问一个问题:如果只想获取关于标签的锚文本?我的意思是:Penna,Bjbj,Ciao。感谢您对我的项目的宝贵帮助!
  • $xpath->evaluate("//a");将返回 DOMElement 节点的 DOMNodeList。您可以使用 foreach() 对其进行迭代并读取 $nodeValue 属性。
  • "转义元素被转换 (&amp;gt; 回到 &gt;)" - 除非我误解,这是错误的:CDATA 保留所有数据原样,直到它到达结尾 @ 987654328@.
  • 在 CDATA 块中将 ]]&amp;gt; 转义为 ]]&amp;gt; 将不起作用。 &amp; 仍然是一个字面量,所以你只需要字面上的 ]]&amp;gt;stackoverflow.com/questions/538163/…
猜你喜欢
  • 1970-01-01
  • 2019-06-03
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2012-05-30
  • 2012-07-22
  • 1970-01-01
相关资源
最近更新 更多