使用 Html Agility Pack 解析 Html 页面答案

【问题标题】：Html page parsing using Html Agility Pack使用 Html Agility Pack 解析 Html 页面
【发布时间】：2014-08-31 15:25:22
【问题描述】：

我正在尝试使用 Regex 解析 the IMDb page（我知道 HAP 更好），但我的 RegEx 是错误的，所以您可以建议我如何正确使用 HAP。

这是我要解析的页面部分。我需要从这里取 2 个数字：

五分之五的人（所以我需要这两个五个，两个数字）

<small>5 out of 5 people found the following review useful:</small>
<br>
<a href="/user/ur1174211/">
<h2>Interesting, Particularly in Comparison With "La Sortie des usines Lumière"</h2>
<b>Author:</b>
<a href="/user/ur1174211/">Snow Leopard</a>
<small>from Ohio</small>
<br>
<small>10 March 2005</small>

这是我在 c# 上的代码

Regex reg1 = new Regex("([0-9]+(out of)+[0-9])");
for (int i = 0; i < number; i++)
        {
            Console.WriteLine("the heading of the movie is {0}", header[i].InnerHtml);
            Match m = reg1.Match(header[i].InnerHtml);

            if (!m.Success)
            {
                return;
            }
            else
            {
                string str1 = m.Value.Split(' ')[0];
                string str2 = m.Value.Split(' ')[3];

                if (!Int32.TryParse(str1, out index1))
                {
                    return;
                }
                if (!Int32.TryParse(str2, out index2))
                {
                    return;
                }
                Console.WriteLine("index1 = {0}", index1);
                Console.WriteLine("index2 = {0}", index2);
            }
        }

非常感谢所有阅读本文的人。

【问题讨论】：

标签： c# html regex html-parsing html-agility-pack

【解决方案1】：

试试这个。这样，您不仅可以获取数字，还可以获取数字。

    Regex reg1 = new Regex(@"(\d* (out of) \d*)");
    for (int i = 0; i < number; i++)
    {
      Console.WriteLine("the heading of the movie is {0}", header[i].InnerHtml);
      Match m = reg1.Match(header[i].InnerHtml);

      if (!m.Success)
      {
          return;
      }
      else
      {
          Regex reg2 = new Regex(@"\d+");
          m = reg2.Match(m.Value);
          string str1 = m.Value;
          string str2 = m.NextMatch().Value;

          if (!Int32.TryParse(str1, out index1))
          {
              return;
          }
          if (!Int32.TryParse(str2, out index2))
          {
              return;
          }
          Console.WriteLine("index1 = {0}", index1);
          Console.WriteLine("index2 = {0}", index2);
      }
    }

【讨论】：

【解决方案2】：

如果你有 small 标签的 InnerHtml 那么这也可以用来获取数字

var title = "5 out of 5 people found the following review useful:";
var titleNumbers = title.ToCharArray().Where(x => Char.IsNumber(x));

编辑

正如@PulseLab 建议的那样，我有另一种方法

var sd = s.Split(' ').Where((data) =>
        {
            var datum = 0;
            int.TryParse(data, out datum);
            return datum > 0;
        }).ToArray();

【讨论】：

如果 iMDB 中的数字是两位数，我不确定这是否可行，例如“12 人中有 10 人认为以下评论很有用”——在这种情况下，您最终不会得到一个由 4 个数字字符组成的数组吗？
数字可以很大，比如“112 out of 504”等，这就是评论对电影的用处
@Kate21 你称之为“504 中的 112”大