【问题标题】:VB.net link extraction with HtmlAgilityPackVB.net 使用 HtmlAgilityPack 提取链接
【发布时间】:2013-08-28 01:42:03
【问题描述】:

我能够使用简单的 href 标签提取 URL,如下所示:

<a href="http://www.samplesite.com">

但我的问题是如何从看起来像这样的 href 标记中提取链接?

<a href="http://www.wherecreativitygoestoschool.com/vancouver/left_right/rb_test.htm" onmousedown="return rwt(this,'','','','1','AFQjCNHvlwTxfBVEYcqGUnilAZN0uY2IXw','','0CCsQFjAA','','',event)">
Right Brain vs Left Brain Creativity <em>Test</em> at The Art Institute of <b>...</b></a>

这是我的完整代码:

 Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
            Dim webClient As New System.Net.WebClient
            Dim WebSource As String = webClient.DownloadString("http://www.google.com.ph/search?hl=en&as_q=test&as_epq=&as_oq=&as_eq=&as_nlo=&as_nhi=&lr=&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=ctr%3AcountryCA&as_filetype=&as_rights=#as_qdr=all&cr=countryCA&fp=1&hl=en&lr=&q=test&start=20&tbs=ctr:countryCA")

        Dim doc = New HtmlAgilityPack.HtmlDocument()
            doc.LoadHtml(WebSource)
            Dim links = GetLinks(doc, "test")
            For Each Link In links
                ListBox1.Items.Add(Link.ToString())
            Next
        End Sub


       Public Class Link
            Public Sub New(Uri As Uri, Text As String)
                Me.Uri = Uri
                Me.Text = Text
            End Sub
            Public Property Text As String
            Public Property Uri As Uri

            Public Overrides Function ToString() As String
                Return String.Format(If(Uri Is Nothing, "", Uri.ToString()))
            End Function
        End Class


        Public Function GetLinks(doc As HtmlAgilityPack.HtmlDocument, linkContains As String) As List(Of Link)
            Dim uri As Uri = Nothing
            Dim linksOnPage = From link In doc.DocumentNode.Descendants()
                              Where link.Name = "a" _
                              AndAlso link.Attributes("href") IsNot Nothing _
                              Let text = link.InnerText.Trim()
                              Let url = link.Attributes("href").Value
                              Where url.IndexOf(linkContains, StringComparison.OrdinalIgnoreCase) >= 0 _
                              AndAlso uri.TryCreate(url, UriKind.Absolute, uri)

            Dim Uris As New List(Of Link)()
            For Each link In linksOnPage
                Uris.Add(New Link(New Uri(link.url, UriKind.Absolute), link.text))
            Next

            Return Uris
        End Function

我注意到我的代码没有提取以&lt;/a&gt; 结尾的链接。我可以做些什么来修改我的代码以提取以&lt;/a&gt; 结尾的链接?

【问题讨论】:

    标签: vb.net href html-agility-pack


    【解决方案1】:

    使用以下代码从页面中获取所有具有“href”属性的链接:

    Dim hNodeCol as HTMLNodeCollection = doc.DocumentNode.SelectNodes("//a[@href]")
    

    ...如果你仍然需要它,当然;)

    【讨论】:

      猜你喜欢
      • 2013-08-17
      • 1970-01-01
      • 1970-01-01
      • 2017-01-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-09-25
      • 1970-01-01
      相关资源
      最近更新 更多