使用 Xml 文档读取网页文本内容答案

【问题标题】：Reading web page text content using Xml document使用 Xml 文档读取网页文本内容
【发布时间】：2017-02-13 20:08:44
【问题描述】：

我正在尝试使用 Xml 文档读取网页文本：

XmlDocument document = new XmlDocument();
string site = "https://emailhunter.co/search/a-bs.com";
document.Load(site);
string allText = document.InnerText;

这是我得到的例外：

System.Xml.dll 中出现“System.Xml.XmlException”类型的未处理异常附加信息：';'字符，十六进制值 0x3B，不能包含在名称中。第 5 行，位置 383。

我真的不明白这里有什么问题。如果您能给我一些建议，我将不胜感激。

【问题讨论】：

该 url 链接到的不是 xml 文档，而是一个 html 文档。我认为您尝试从登录保护的 url 中读取内容。
但是HTML文档是 XML文档，不是吗？
不，HTML 不是 XML。见这里：stackoverflow.com/questions/5472162/how-to-read-html-as-xml
该页面未通过登录保护

标签： c# xml webpage

【解决方案1】：

你可以使用 Html Agility Pack，就像这篇文章中写的那样：What is the best way to parse html in C#?

【讨论】：