【发布时间】:2023-03-15 15:47:01
【问题描述】:
我正在尝试从具有如下格式的 HTML 文本中识别文本节点
示例文本 1:<strong>[Hot Water][Steam][Electric]</strong> Preheating Coil
示例文本 2:<b><span>[Steam] [Natural Gas Fired] [Electric] [Steam to steam]</span></b><span> Humidifier</span><br>
使用下面的代码
public static string IdentifyHTMLTagsAndRemove(string htmlText)
{
_ = htmlText ?? throw new ArgumentNullException(nameof(htmlText));
var document = new HtmlDocument();
document.LoadHtml(htmlText);
var rootNode = document.DocumentNode;
// get first and last text nodes
var nonEmptyTextNodes = rootNode.SelectNodes("//text()[not(self::text())]") ?? new HtmlNodeCollection(null);
//if (nonEmptyTextNodes.Count == 0)
//{
// return rootNode.OuterHtml;
//}
if (nonEmptyTextNodes.Count > 0)
{
var firstTextNode = nonEmptyTextNodes[0];
var lastTextNode = nonEmptyTextNodes[^1];
// get all br nodes in html string,
var breakNodes = rootNode.SelectNodes("//br") ?? new HtmlNodeCollection(null);
var lastTextNodeLengthIndex = lastTextNode.OuterStartIndex + lastTextNode.OuterLength;
foreach (var breakNode in breakNodes)
{
if (breakNode == null)
continue;
// check index of br nodes against first and last text nodes
// and remove br nodes that sit outside text nodes
if (breakNode.OuterStartIndex <= firstTextNode.OuterStartIndex
|| breakNode.OuterStartIndex >= lastTextNodeLengthIndex)
{
breakNode.Remove();
}
}
}
return rootNode.OuterHtml;
}
但这里总是失败
var nonEmptyTextNodes = rootNode.SelectNodes("//text()[not(self::text())]") ??新的 HtmlNodeCollection(null);
和nonEmptyTextNodes 给计数为零,我不确定我在哪里做错了上面的代码。
谁能指出我正确的方向?非常感谢。
【问题讨论】:
标签: c# .net xml html-agility-pack