如何在c#中从html中解析出文本答案

【问题标题】：How to parse the text out of html in c#如何在c#中从html中解析出文本
【发布时间】：2012-09-12 16:32:20
【问题描述】：

我有一个这样的 html 表达式：

 "This is <h4>Some</h4> Text" + Environment.NewLine +
 "This is some more <h5>text</h5>

我只想提取文本。所以结果应该是

"This is Some Text" + Environment.NewLine +
 "This is some more text"

我该怎么做？

【问题讨论】：

stackoverflow.com/questions/1038431/…

标签： c# html xml parsing

【解决方案1】：

使用HtmlAgilityPack

string html = @"This is <h4>Some</h4> Text" + Environment.NewLine +
                "This is some more <h5>text</h5>";

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var str = doc.DocumentNode.InnerText;

【讨论】：

【解决方案2】：

简单使用正则表达式：Regex.Replace(source, "<.*?>", string.Empty);

【讨论】：

正则表达式可能有问题。试试前，这个<h4 title='e>Sh<opping'>it happens</h4>
是的，这是真的，我实际上不建议在任何地方都使用正则表达式（并使用其他人测试过的解决方案）。仍然 - 对于更简单的情况，这样的正则表达式就足够了。
Yes ;) 我的正则表达式在正确转义的 xhtml 上运行良好，但我明白你的意思。我的回答不是关于一般解决方案（就像你的那样），而是针对特定情况的解决方案。