【发布时间】:2014-06-18 11:46:20
【问题描述】:
下面是一些随机的、不可预测的标签集,它们包含在 div 标签中。如何分解所有子标签 innerHTML 保留其出现的顺序。
注意:如果是img,iframe标签只需要提取url。
<div>
<p>para-1</p>
<p>para-2</p>
<p>
text-before-image
<img src="text-image-src"/>
text-after-image</p>
<p>
<iframe src="p-iframe-url"></iframe>
</p>
<iframe src="iframe-url"></iframe>
<h1>header-1</h1>
<img src="image-url"/>
<p>
<img src="p-image-url"/>
</p>
content not wrapped within any tags
<h2>header-2</h2>
<p>para-3</p>
<ul>
<li>list-item-1</li>
<li>list-item-2</li>
</ul>
<span>span-content</span>
content not wrapped within any tags
</div>
预期数组:
["para-1","para-2","text-before-image","text-image-src","text-after-image",
"p-iframe-url","iframe-url","header-1","image-url",
"p-image-url","content not wrapped within any tags","header-2","para-3",
"list-item-1","list-item-2","span-content","content not wrapped within any tags"]
相关代码:
$dom = new DOMDocument();
@$dom->loadHTML( $content );
$tags = $dom->getElementsByTagName( 'p' );
// Get all the paragraph tags, to iterate its nodes.
$j = 0;
foreach ( $tags as $tag ) {
// get_inner_html() to preserve the node's text & tags
$con[ $j ] = $this->get_inner_html( $tag );
// Check if the Node has html content or not
if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
// Check if the node contains html along with plain text with out any tags
if ( $tag->nodeValue != '' ) {
/*
* DOM to get the Image SRC of a node
*/
$domM = new DOMDocument();
/*
* Setting encoding type http://in1.php.net/domdocument.loadhtml#74777
* Set after initilizing DomDocument();
*/
$con[ $j ] = mb_convert_encoding( $con[ $j ], 'HTML-ENTITIES', "UTF-8" );
@$domM->loadHTML( $con[ $j ] );
$y = new DOMXPath( $domM );
foreach ( $y->query( "//img" ) as $node ) {
$con[ $j ] = "img=" . $node->getAttribute( "src" );
// Increment the Array size to accomodate bad text and image tags.
$j++;
// Node incremented, fetch the node value and accomodate the text without any tags.
$con[ $j ] = $tag->nodeValue;
}
$domC = new DOMDocument();
@$domC->loadHTML( $con[ $j ] );
$z = new DOMXPath( $domC );
foreach ( $z->query( "//iframe" ) as $node ) {
$con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
// Increment the Array size to accomodate bad text and image tags.
$j++;
// Node incremented, fetch the node value and accomodate the text without any tags.
$con[ $j ] = $tag->nodeValue;
}
} else {
/*
* DOM to get the Image SRC of a node
*/
$domA = new DOMDocument();
@$domA->loadHTML( $con[ $j ] );
$x = new DOMXPath( $domA );
foreach ( $x->query( "//img" ) as $node ) {
$con[ $j ] = "img=" . $node->getAttribute( "src" );
}
if ( $con[ $j ] != strip_tags( $con[ $j ] ) ) {
foreach ( $x->query( "//iframe" ) as $node ) {
$con[ $j ] = "vid=http:" . $node->getAttribute( "src" );
}
}
}
}
// INcrement the node
$j++;
}
$this->content = $con;
【问题讨论】:
-
@jeroen 使用 dom api,成功地仅提取
标签 innerhtml 保留其出现。但是当存在 p 以外的标签时失败。
-
为什么不只是
strip_tags()?这将抽出所有包含的 html 并只留下文本,并按照 html/文本在文件中出现的顺序执行。 -
@MarcB 如果只是 strip_tags(),iframe 和图像路径会发生什么
-
您不想获得“innerHTML”。您想查看检索“属性”(例如 iframe src)的值以及元素的“文本内容”。这些关键字应该可以帮助您前进。
-
如果您向我们展示了代码的相关部分,我们会更好地了解您选择了哪种方法,如果您没有遇到概念性错误(仅),我们甚至可能会发现错误。