【发布时间】:2011-09-22 15:04:20
【问题描述】:
我从服务器收到如下所示的 html。我通过使用 XPath exp @"//text()" 并将“nodeContent”值附加到字符串来重建文本部分。代码是这样的:
for (int i=2; i<[resultXPathQuery count]; i++) {
[mytext appendString:[[resultXPathQuery objectAtIndex:i] objectForKey:@"nodeContent"]];
[mytext appendString:@"\n"];
}
我得到:
Line 1
line 2
line 3
line 4
如何在考虑空节点的情况下构建文本部分?
我想获得:
Line 1
line 2
line 3
line 4
<html><head><title>A title</title><style type="text/css">
ol{margin:0;padding:0}p{margin:0}
.c0{font-size:12pt;background-color:#ffffff;font-family:Times New Roman}
.c6{width:432.0pt;background-color:#ffffff;padding:72.0pt 90.0pt 72.0pt 90.0pt}
.c7{color:#aaaaaa;font-family:Times New Roman}
.c3{color:#0000ee;text-decoration:underline}
.c5{color:inherit;text-decoration:inherit}
.c2{font-size:12pt;font-family:Times New Roman}
.c4{height:12pt}.c1{direction:ltr}
body{color:#000000;font-size:12pt;font-family:Times New Roman}
h1{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:24pt;font- family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
h2{padding-top:11.25pt;line-height:1.0;text-align:left;color:#000000;font-size:18pt;font-family:Times New Roman;font-weight:bold;padding-bottom:11.25pt}
h3{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:14pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
h4{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:12pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
h5{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:9pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
h6{padding-top:18.0pt;line-height:1.0;text-align:left;color:#000000;font-size:8pt;font-family:Times New Roman;font-weight:bold;padding-bottom:18.0pt}</style>
</head>
<body class="c6">
<p class="c1"><span class="c2">A title</span></p>
<p class="c1 c4"><span class="c2"></span></p>
<p class="c4 c1"><span class="c2"></span></p>
<p class="c1"><span class="c7">Line 1</span></p>
<p class="c1"><span class="c7">line 2</span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c1"><span class="c7">line 3</span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c3 c2"><span class="c1"></span></p>
<p class="c1"><span class="c7">line 4</span></p>
</body></html>
编辑
真的,我注意到 html 可能更“复杂”,因此选择所有 span 元素或 p 元素是不够的。此外,更多的 span 元素可以出现在同一个 p 元素中,所以在这种情况下,我不必在我的字符串中创建新行。
这是一个更复杂的返回 html 的主体:
<body class="c13">
<p class="c5"><span>gfgfgfd</span></p>
<p class="c1"><span></span></p>
<p class="c5 c10"><span>ghhgfhgfh hghg hgkfhjgk ghjgkh ghjgjhg gjhjg gjhj gjhgjhgjhg gfhjkgjg jghjgfhjgf fghfj jghfj fghjggf jhgjgjgkjg</span></p>
<p class="c1 c10"><span></span></p>
<p class="c4"><span>gfgfgfd</span></p>
<p class="c4"><span>f</span></p>
<p class="c4">
<span>gfdgfdg</span>
<span class="c7">hg</span></p>
<p class="c4"><span class="c7">ghgfhgfh</span></p>
<p class="c4"><span class="c7">gfhgfhgf</span></p>
<p class="c5">
<span class="c7">hgfh </span>
<span class="c0">gfdgfg</span></p>
<p class="c5"><span class="c0">fgfdgfdgfd</span></p>
<p class="c5"><span class="c0">gdfgdfgfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="http://www.google.com">www.google.com</a></span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">fgfdgfdg</span></p>
<p class="c5">
<span class="c0">fgffgfdgfg</span>
<span class="c0 c11">gfgfdgfd fgd fd</span>
<span class="c0">fdgfdg</span></p>
<p class="c5"><span class="c0">fgfdgfdgf</span></p>
<p class="c5"><span class="c0">gfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="mailto:….">...</a></span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c9" start="1">
<li class="c3"><span class="c0">gfgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gdfgfd</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c5"><span class="c0">gfhgfh</span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c2" start="1">
<li class="c3"><span class="c0">gfhg</span></li>
<li class="c3"><span class="c0">hgfh</span></li>
<li class="c3"><span class="c0">hgf</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<h1 class="c5 c15"><a name="h.kafwflosthlg"></a><span class="c7 c14">hgfhgfh</span></h1>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
</body>
我需要一个 XPath 表达式来选择 p、h1、h2、...、h6、li 元素,并以正确检测新行和空行的方式考虑内部文本部分。
【问题讨论】:
标签: xpath