在 C# 中使用 match 提取两个字符串分隔符之间的字符串内容答案

【问题标题】：Extract the contents of a string between two string delimiters using match in C#在 C# 中使用 match 提取两个字符串分隔符之间的字符串内容
【发布时间】：2012-06-13 13:09:53
【问题描述】：

所以，假设我正在解析以下 HTML 字符串：

<html>
    <head>
        RANDOM JAVASCRIPT AND CSS AHHHHHH!!!!!!!!
    </head>
    <body>
        <table class="table">
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
            <tr><a href="/subdir/members/Name">Name</a></tr>
        </table>
    <body>
</html>

我想隔离 ** 的内容（表类中的所有内容）

现在，我使用正则表达式来完成此操作：

string pagesource = (method that extracts the html source and stores it into a string);
string[] splitSource = Regex.Split(pagesource, "<table class=/"member/">;
string memberList = Regex.Split(splitSource[1], "</table>");
//the list of table members will be in memberList[0];
//method to extract links from the table
ExtractLinks(memberList[0]);

我一直在寻找执行此提取的其他方法，并且在 C# 中遇到了 Match 对象。

我正在尝试做这样的事情：

Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n)*?</table>");

上述的目的是希望提取两个分隔符之间的匹配值，但是，当我尝试运行它时，匹配值是：

match.value = </table>

因此，我的问题是：有没有一种方法可以从我的字符串中提取数据，它比我使用正则表达式的方法更容易/更易读/更短？对于这个简单的例子，正则表达式很好，但对于更复杂的例子，我发现自己在我的屏幕上到处都是涂鸦。

我真的很想使用 match，因为它看起来是一个非常整洁的类，但我似乎无法让它满足我的需要。谁能帮我解决这个问题？

非常感谢！

【问题讨论】：

一个小提示：两个表格标签之间的正则表达式部分应为(.|\n)*?。如果您不在.|\n 周围加上括号，那么*? 将仅适用于它之前的字符（在这种情况下为\n）。
RegEx match open tags except XHTML self-contained tags 的可能重复项
Don't parse HTMl with regex
你是不是错过了一些<td>标签？
是的，我输入了 html 并没有注意 =p.

标签： c# regex match

【解决方案1】：

使用 HTML 解析器，例如 HTML Agility Pack。

var doc = new HtmlDocument();

using (var wc = new WebClient())
using (var stream = wc.OpenRead(url))
{
    doc.Load(stream);
}

var table = doc.DocumentElement.Element("html").Element("body").Element("table");
string tableHtml = table.OuterHtml;

【讨论】：

我实际上正在尝试 HTML 敏捷包，但缺乏文档是可怕的！并且新的可下载文件没有 chm，因此，为了寻求帮助，我基本上是在查看可下载文件附带的清单……总而言之，它不会带来友好的体验！
@gfppaste，实际上不需要文档，API 非常不言自明，与 Linq to XML 非常相似。我通过使用 Intellisense 学会了使用它，它非常直观。

【解决方案2】：

您可以将 XPath 与 HTmlAgilityPack 一起使用：

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
var elements = doc.DocumentNode.SelectNodes("//table[@class='table']");

foreach (var ele in elements)
{
    MessageBox.Show(ele.OuterHtml);
}

【讨论】：

【解决方案3】：

您已在正则表达式中添加括号以捕获匹配项：

Match match = Regex.Match(pageSource, "<table class=\"members\">(.|\n*?)</table>");

无论如何，似乎只有 Chuck Norris 可以正确解析带有正则表达式的 HTML。

【讨论】：