关于某个正则表达式的建议答案

【问题标题】：Advice on a certain regular expression关于某个正则表达式的建议
【发布时间】：2019-07-09 07:32:59
【问题描述】：

我正在从电子邮件内容中提取链接，因此使用Regex 和String.Split 从已经解析的Content-type: text/html 中提取重要信息。
因为直到现在我才接触过正则表达式，所以我使用了一个在线编辑器，我提供了我电子邮件的一部分并围绕它构建了我的Regex pattern。现在它似乎工作得很好，但是我的代码是一团糟，因为我没有完全理解我写的内容。

我目前处理链接提取的方法是删除电子邮件的某些部分（它们是 HTML 标记），然后将获得的字符串拆分两次。

这是我在 Regex 上测试的示例（这正是我将内容提取为字符串时的样子，我只是将使用过的链接替换为类似的示例）：

<div dir="ltr">

<div>Link text == link (link text would be changed to "Protected link"): 
    <a href="http://www.google.de" 
        target=5Fblank">
            Protected link
    </a>
</div>

<div>Link text != link (link text and link would be rewritten and not equal): 
    <a href="http://www.google.de">
        http://www.google.com
    </a>
</div>

<div>Link text != link (link would be rewritten but not link text):
    <a href="http://www.google.de">
        Click!
    </a>
</div>

<div>Link text != link (link would be not rewritten, in whitelist): 
    <a href="http://www.google.de">
        Click!
    </a>
</div>

<div>Link is not rewritten: 
    <a href="http://www.google.de">
        http://www.google.de
    </a>
</div>

<div>Link text != link (no protocol in link text and would be not rewritten): 
    <a href="http://www.google.de">
        www.google.de
    </a>
</div>

而我使用的Regular Expression 是这样的：

"(href=\"[a-zA-Z0-9-:/.=?]*\"*[a-zA-Z0-9=\" ]*)([>a-zA-Z0-9-:/.,;\"=!? \t\n]*)"

将提取的链接和链接文本写入数组后，我将它们拆分了两次。
首先在这个> 字符处，然后如果提取的字符串以href=" 开头并在" 字符处拆分。

var linkParser = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
string[] links = new string[linkParser.Matches(text).Count];

int t = 0;
foreach (Match ma in linkParser.Matches(text))
{
    links[t] = ma.Value;
    t++;
}

var list = new List<String[]>();
string[] temp;

for (int i = 0; i < links.Length; i++)
{
    temp = links[i].Split('>');
    list.Add(temp);
}

var pairs = new List<String[]>();

for (int i = 0; i < list.Count; i++)
{
    string[] tmp = list[i];
    for (int j = 0; j < tmp.Length; j++)
    {
        if (tmp[j].StartsWith("href=\""))
        {
            pairs.Add(new String[]
            {
                tmp[j].Split(new string[]
                {
                    "href=\""
                }, StringSplitOptions.None)[1].Split('"')[0], tmp[j + 1]
            });
        }
    }
}

【问题讨论】：

您可能会发现使用 HTML DOM 解析器（如 HTML Agility Pack）更容易解决此问题。
当我昨天尝试打开他们的网站时，它没有加载（我猜是维护）所以我尝试使用正则表达式来解决问题。稍后我会研究它，看看它是否对我有帮助。我的方法现在对我有用，所以我可以使用它来获取链接的响应代码，但我看到我的模式在未来某个时候会遇到困难
您可以（并且可能应该）通过 NuGet nuget.org/packages/HtmlAgilityPack 安装 HTML Agility Pack
我总是通过 NuGet 安装包，但我需要文档才能充分利用它
网站现在已经上线了，至少对我来说是这样。如果它失败了，您仍然可以在此处找到与您自己的堆栈溢出需求类似的好示例，或者在某处找到“介绍”博客文章。

标签： c# .net regex

【解决方案1】：

由于链接用引号“括起来，您可以将您的正则表达式简化为 (href="[^"]+")。为了澄清这一点，它匹配 href="，然后匹配任何数字（多于一个）的任何字符，除了引号 "，然后是引号 " 字符。
您也可以使用组直接获取链接，而不是拆分/替换字符串
由于您需要链接及其文本，请尝试以下操作：
已编辑：

var matches = Regex.Matches(str, "<a href=\"(?<link>[^ \"]+)\"[^>]*>(?<text>(.|\n)*?)(?=(<\\/a>))<\\/a>");
for(int i = 0; i < matches.Count; i++)
{
    Console.WriteLine($"{matches[i].Groups["link"].Value} {matches[i].Groups["text"].Value}");
}

这个的输出是：

http://www.google.de             Protected link    
http://www.google.de         http://www.google.com    
http://www.google.de         Click!    
http://www.google.de         Click!    
http://www.google.de         http://www.google.de    
http://www.google.de         www.google.de

我希望这就是你要找的东西

【讨论】：

这允许我获取链接，但不能获取安全检查所需的链接文本
“链接而不是链接文本”是什么意思？
google.de">Click!</a> href 是链接，“点击！”链接文本
它似乎在某些特定情况下有效。我会将 pastebin 链接链接到我从我的示例中获得的输出，我在上面测试过：pastebin.com/pKECZbTu