Webscraping的更好解决方案[关闭]答案

【问题标题】：A better solution for Webscraping [closed]Webscraping的更好解决方案[关闭]
【发布时间】：2015-07-01 19:03:58
【问题描述】：

目标：
使用带有 C# 代码的 webscape 从网站“http://en.wikipedia.org/wiki/Main_Page”中找到句子“来自今天的精选文章”。

问题：
您在字符串值中检索网站的源代码。相信你可以通过子串循环找到“来自今日精选文章”这句话。我觉得这是一种低效的方法。

有没有更好的办法从字符串输入中定位句子“来自今日精选文章”？

信息：
*我在 Visual Studio 2013 社区中使用 C# 代码。
*源代码不能正常工作。前三行正在工作。

WebClient w = new WebClient();

string s = w.DownloadString("http://en.wikipedia.org/wiki/Main_Page");

string svar = RegexUtil.MatchKey(input);




static class RegexUtil
{
    static Regex _regex = new Regex(@"$ddd$");
    /// <summary>
    /// This returns the key that is matched within the input.
    /// </summary>
    static public string MatchKey(string input)
    {
        //Match match = Regex.Match(input, @"From today's featured article", RegexOptions.IgnoreCase);

        Match match = _regex.Match(input);
        //  Match match = regex.Match("Dot 55 Perls");


        if (match.Success)
        {
            return match.Groups[1].Value;
        }
        else
        {
            return null;
        }
    }
}

【问题讨论】：

不要使用正则表达式解析html使用HtmlAgilityPack
这不是家庭作业。您使用正则表达式来验证或匹配“形成今天的特色文章”的代码与包含大量数据的输入数据。
前三行是WebClient w = new WebClient(); string s = w.DownloadString("en.wikipedia.org/wiki/Main_Page"); string svar = RegexUtil.MatchKey(input);

标签： c# web-scraping

【解决方案1】：

如果你想找到那个字符串的出现，你需要做的就是：

int pos = html.IndexOf("From today's featured article");

但是，您应该注意，这可以在引号或标记中找到字符串，而不仅仅是从可见文本中。

为了只搜索可见文本，您需要解析 HTML 以删除所有标签，然后搜索其间的文本。

【讨论】：

你能用“html.IndexOf”作为c#代码吗？就我而言，我在 Visual Studio 中使用 c#
我不知道“作为 C# 代码”是什么意思。我发布了 C# 代码。你试过了吗？
C# 语法。我还没有尝试过代码。如果有效，谢谢！
你的问题有c#标签，所以我发布了C#代码。
我明白你在说什么。