在下载的 txt 文件中提取 url 链接答案

【问题标题】：Extracting url links within a downloaded txt file在下载的 txt 文件中提取 url 链接
【发布时间】：2015-12-29 15:54:40
【问题描述】：

目前正在开发一个用于工作的 url 提取器。我正在尝试从下载的 html 文件中提取所有 http 链接/ href 链接，并在单独的 txt 文件中自己打印链接。到目前为止，我已经设法下载了一个页面的整个 html，它只是提取了链接从中并使用正则表达式打印它们是一个问题。想知道是否有人可以帮助我解决这个问题？

     private void button2_Click(object sender, EventArgs e)
    {
        Uri fileURI = new Uri(URLbox2.Text);

        WebRequest request = WebRequest.Create(fileURI);
        request.Credentials = CredentialCache.DefaultCredentials;
        WebResponse response = request.GetResponse();
        Console.WriteLine(((HttpWebResponse)response).StatusDescription);
        Stream dataStream = response.GetResponseStream();
        StreamReader reader = new StreamReader(dataStream);
        string responseFromServer = reader.ReadToEnd();

        SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
        SW.WriteLine(responseFromServer);

        SW.Close();

        string text = System.IO.File.ReadAllText(@"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
        string[] links = System.IO.File.ReadAllLines(@"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");



        Regex regx = new Regex(links, @"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

        MatchCollection mactches = regx.Matches(text);

        foreach (Match match in mactches)
        {
            text = text.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
        }

        SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\Links.htm");
        SW.WriteLine(links);
    }

【问题讨论】：

问题 1) 你试过用谷歌搜索问题吗？
你能澄清一下这句话吗？我不是 100% 清楚什么在起作用，哪个部分阻止了你：“到目前为止，我已经设法下载了一个页面的整个 html，它只是从中提取链接并使用 Regex 打印它们是一个问题。”
您的代码示例看起来不像是在尝试做您说您想做的事情。您是否从某个地方复制并粘贴了此内容？
必须使用正则表达式吗？ HTML 敏捷包使这变得非常容易，请参阅 here.
@Starceaker 我可以下载整个 html 网页，并且在该网页中有许多 href 链接，我正在尝试将它们提取并打印到单独的 txt 文件中？

标签： c# html regex url

【解决方案1】：

如果您不知道，可以使用其中一种可用的 html 解析器 nuget 包来实现（非常容易）。

我个人使用 HtmlAgilityPack（以及另一个包 ScrapySharp）和 AngleSharp。

仅使用上面的 3 行，您就可以使用 HtmlAgilityPack 获得由 http get 请求加载的文档中的所有 href：

/*
  do not forget to include the usings:
  using HtmlAgilityPack;
  using ScrapySharp.Extensions;
*/

HtmlWeb w = new HtmlWeb();
//since you have your html locally stored, you do the following:

//P.S: By prefixing file path strings with @, you are rid of having to escape slashes and other fluffs.
var doc = HtmlDocument.LoadHtml(@"C:\Users\Conal_Curran\OneDrive\C#\MyProjects\Web Crawler\URLTester\response1.htm");

//for an http get request
//var doc = w.Load("yourAddressHere");
var hrefs = doc.DocumentNode.CssSelect("a").Select(a => a.GetAttributeValue("href"));

【讨论】：

比使用任何正则表达式更有效的简单解决方案。 +1
我必须开始慢慢地将我的所有代码移动到 AngleSharp，因为不再维护 HtmlAgilityPack。