【问题标题】:Extracting URLs using regex in .NET在 .NET 中使用正则表达式提取 URL
【发布时间】:2010-01-31 23:37:58
【问题描述】:

我从以下 URL csharp-online 中的示例中获得灵感 并打算从此页面检索所有 URL alexa

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.Text.RegularExpressions;
namespace ExtractingUrls
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient client = new WebClient();
            const string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
            string source = client.DownloadString(url);
            //Console.WriteLine(Getvals(source));
            string matchPattern =
                    @"<a.rel=""nofollow"".style=""font-size:0.8em;"".href=[""'](?<url>[^""^']+[.]*)[""'].class=""offsite"".*>(?<name>[^<]+[.]*)</a>";
            foreach (Hashtable grouping in ExtractGroupings(source, matchPattern, true))
            {
                foreach (DictionaryEntry DE in grouping)
                {
                    Console.WriteLine("Value = " + DE.Value);
                    Console.WriteLine("");
                }
            }
            // End.
            Console.ReadLine();
        }
        public static ArrayList ExtractGroupings(string source, string matchPattern, bool wantInitialMatch)
        {
            ArrayList keyedMatches = new ArrayList();
            int startingElement = 1;
            if (wantInitialMatch)
            {
                startingElement = 0;
            }
            Regex RE = new Regex(matchPattern, RegexOptions.Multiline);
            MatchCollection theMatches = RE.Matches(source);
            foreach (Match m in theMatches)
            {
                Hashtable groupings = new Hashtable();
                for (int counter = startingElement; counter < m.Groups.Count; counter++)
                {
                    // If we had just returned the MatchCollection directly, the
                    // GroupNameFromNumber method would not be available to use
                    groupings.Add(RE.GroupNameFromNumber(counter),
                    m.Groups[counter]);
                }
                keyedMatches.Add(groupings);
            }
            return (keyedMatches);
        }
    }
}

但是在这里我遇到了一个问题,当我执行每个 URL 时会显示三次,首先是显示整个锚标记,然后是显示两次 URL。谁能建议我应该在哪里更正,以便我可以让每个 URL 只显示一次。

【问题讨论】:

标签: c# .net regex


【解决方案1】:

使用HTML Agility Pack 解析HTML。我认为这将使您的问题更容易解决。

这是一种方法:

WebClient client = new WebClient();
string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology";
string source = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(source);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']"))
{
    Console.WriteLine(link.Attributes["href"].Value);
}

【讨论】:

    【解决方案2】:

    在您的正则表达式中,您有两个分组和整个匹配。如果我没看错,你应该只想要匹配的 URL 部分,这是 3 个分组中的第二个......

    而不是这个:

    for (int counter = startingElement; counter < m.Groups.Count; counter++)
                {
                    // If we had just returned the MatchCollection directly, the
                    // GroupNameFromNumber method would not be available to use
                    groupings.Add(RE.GroupNameFromNumber(counter),
                    m.Groups[counter]);
                }
    

    你不想要这个吗?:

    groupings.Add(RE.GroupNameFromNumber(1),m.Groups[1]);
    

    【讨论】:

      【解决方案3】:
      int startingElement = 1;
      if (wantInitialMatch)
      {
      startingElement = 0;
      }
      

      ...

      for (int counter = startingElement; counter < m.Groups.Count; counter++)
      {
      // If we had just returned the MatchCollection directly, the
      // GroupNameFromNumber method would not be available to use
          groupings.Add(RE.GroupNameFromNumber(counter),
          .Groups[counter]);
      }
      

      你传递了wantInitialMatch = true,所以你的for循环正在返回:

      .Groups[0] //entire match
      .Groups[1] //(?<url>[^""^']+[.]*) href part
      .Groups[2] //(?<name>[^<]+[.]*) link text
      

      【讨论】:

        【解决方案4】:
        猜你喜欢
        • 1970-01-01
        • 2017-03-04
        • 1970-01-01
        • 2018-11-04
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多