【问题标题】:Retrieve attributes and span using HTMLAgilityPack library使用 HTMLAgilityPack 库检索属性和跨度
【发布时间】:2013-12-24 04:50:31
【问题描述】:

在这段 HTML 代码中:

<div class="item">

    <div class="thumb">
        <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads">
        <img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
    </div>

    <div class="release">
        <h3>Wolf Eyes</h3>
        <h4>
        <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" title="Wolf Eyes - Lower Demos">Lower Demos</a>
        </h4>
        <script src="/ads/button.js"></script>
    </div>

    <div class="release-year">
        <p>Year</p>
        <span>2013</span>
    </div>

    <div class="genre">
        <p>Genre</p>
        <a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
        <a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
    </div>

</div>

我知道如何以其他方式解析它,但我想使用 HTMLAgilityPack 库检索此信息:

Title : Wolf Eyes - Lower Demos
Cover : http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg
Year  : 2013
Genres: Rock, Pop
URL   : http://www.mp3crank.com/wolf-eyes/lower-demos-121866

这些 html 行是什么:

Title : title="Wolf Eyes - Lower Demos"
Cover : src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg"
Year  : <span>2013</span>
Genre1: <a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
Genre2: <a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
URL   : href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" 

这就是我正在尝试的,但是在尝试选择单个节点时,我总是得到一个 object reference not set 异常, 抱歉,我是 HTML 新手,我尝试按照这个问题的步骤进行操作 HtmlAgilityPack basic how to get title and link?

Public Class Form1

    Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
    Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing

    Private Title As String = String.Empty
    Private Cover As String = String.Empty
    Private Genres As String() = {String.Empty}
    Private Year As Integer = -0
    Private URL as String = String.Empty

    Private Sub Test() Handles MyBase.Shown

        ' Load the html document.
        htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

        ' Select the (10 items) nodes.
        htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

        ' Loop trough the nodes.
        For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

            Title = node.SelectSingleNode("//div[@class='release']").Attributes("title").Value
            Cover = node.SelectSingleNode("//div[@class='thumb']").Attributes("src").Value
            Year = CInt(node.SelectSingleNode("//div[@class='release-year']").Attributes("span").Value)
            Genres = ¿select multiple nodes?
            URL = node.SelectSingleNode("//div[@class='release']").Attributes("href").Value

        Next

    End Sub

End Class

【问题讨论】:

    标签: html .net vb.net html-parsing html-agility-pack


    【解决方案1】:

    您的错误是尝试从您找到的子节点中访问子节点的属性。

    当您调用 node.SelectSingleNode("//div[@class='release']") 时,会返回正确的 div,但调用 .Attributes 只会返回 div 标记本身的属性,而不是任何内部 HTML 元素。

    可以编写选择子节点的 XPATH 查询,例如//div[@class='release']/a - 有关 XPATH 的更多信息,请参阅 http://www.w3schools.com/xpath/xpath_syntax.asp。尽管这些示例是针对 XML 的,但大多数原则应该适用于 HTML 文档。

    另一种方法是在您找到的节点上使用进一步的 XPATH 调用。我已经修改了您的代码以使其使用这种方法工作:

    ' Load the html document.
    htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
    
    ' Select the (10 items) nodes.
    htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")
    
    ' Loop through the nodes.
    For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
    
        Dim releaseNode = node.SelectSingleNode(".//div[@class='release']")
        'Assumes we find the node and it has a a-tag
        Title = releaseNode.SelectSingleNode(".//a").Attributes("title").Value
        URL = releaseNode.SelectSingleNode(".//a").Attributes("href").Value
    
        Dim thumbNode = node.SelectSingleNode(".//div[@class='thumb']")
        Cover = thumbNode.SelectSingleNode(".//img").Attributes("src").Value
    
        Dim releaseYearNode = node.SelectSingleNode(".//div[@class='release-year']")
        Year = CInt(releaseYearNode.SelectSingleNode(".//span").InnerText)
    
        Dim genreNode = node.SelectSingleNode(".//div[@class='genre']")
        Dim genreLinks = genreNode.SelectNodes(".//a")
        Genres = (From n In genreLinks Select n.InnerText).ToArray()
    
        Console.WriteLine("Title : {0}", Title)
        Console.WriteLine("Cover : {0}", Cover)
        Console.WriteLine("Year  : {0}", Year)
        Console.WriteLine("Genres: {0}", String.Join(",", Genres))
        Console.WriteLine("URL   : {0}", URL)
    
    Next
    

    请注意,在此代码中,我们假设文档格式正确,并且每个节点/元素/属性都存在且正确。您可能希望为此添加大量错误检查,例如If someNode Is Nothing Then ....

    编辑:我已经稍微修改了上面的代码,以确保每个 .SelectSingleNode 使用“.//”前缀 - 这确保它在有多个“item”节点时工作,否则它选择第一个匹配 来自文档而不是当前节点。

    如果您想要更短的 XPATH 解决方案,下面是使用该方法的相同代码:

    ' Load the html document.
    htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))
    
    ' Select the (10 items) nodes.
    htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")
    
    ' Loop through the nodes.
    For Each node As HtmlAgilityPack.HtmlNode In htmlnodes
    
        Title = node.SelectSingleNode(".//div[@class='release']/h4/a[@title]").Attributes("title").Value
        URL = node.SelectSingleNode(".//div[@class='release']/h4/a[@href]").Attributes("href").Value
    
        Cover = node.SelectSingleNode(".//div[@class='thumb']/a/img[@src]").Attributes("src").Value
    
        Year = CInt(node.SelectSingleNode(".//div[@class='release-year']/span").InnerText)
    
        Dim genreLinks = node.SelectNodes(".//div[@class='genre']/a")
        Genres = (From n In genreLinks Select n.InnerText).ToArray()
    
        Console.WriteLine("Title : {0}", Title)
        Console.WriteLine("Cover : {0}", Cover)
        Console.WriteLine("Year  : {0}", Year)
        Console.WriteLine("Genres: {0}", String.Join(",", Genres))
        Console.WriteLine("URL   : {0}", URL)
        Console.WriteLine()
    
    Next
    

    【讨论】:

    • 谢谢,那么如果我理解得很好,第二个代码是否更安全、更快?我应该使用哪个?我有 10 个项目节点/块我的意思是要检索 10 个标题/封面/网址
    • 我不确定哪个会更好。第二个示例需要很好的 xpath 知识,并且与 HTML 结构的绑定更紧密。例如,标题从 h4 标签更改为 h5,第二个示例会失败,但第一个示例仍然有效。
    • 哦.. 如果速度是个问题,我建议对时间进行一些实验 - 尝试创建一个大的 html 文档并使用 StopWatch 类(在 System.Diagnostics 中)对这些进行计时,看看哪个是更快。
    【解决方案2】:

    您离解决方案并不遥远。两个重要说明:

    • // 是一个递归调用。它可能会对性能产生一些重大影响,并且它可能会选择您不想要的节点,因此我建议您仅在层次结构较深或复杂或可变的情况下使用它,并且您不想指定整个路径。
    • XmlNode 上有一个有用的帮助方法,名为 GetAttributeValue,即使它不存在,你也会得到一个属性(你需要指定默认值)。

    这是一个似乎可行的示例:

    ' select the base/parent DIV (here we use a discriminant CLASS attribute)
    ' all select calls below will use this DIV element as a starting point
    Dim node As HtmlNode = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")
    
    ' get to the A tag which is a child or grand child (//) of a 'release' DIV
    Console.WriteLine(("Title :" & node.SelectSingleNode("div[@class='release']//a").GetAttributeValue("title", CStr(Nothing))))
    
    ' get to the IMG tag which is a child or grand child (//) of a 'thumb' DIV
    Console.WriteLine(("Cover :" & node.SelectSingleNode("div[@class='thumb']//img").GetAttributeValue("src", CStr(Nothing))))
    
    ' get to the SPAN tag which is a child or grand child (//) of a 'release-year' DIV
    Console.WriteLine(("Year  :" & node.SelectSingleNode("div[@class='release-year']//span").InnerText))
    
    ' get all A elements which are child or grand child(//) of a 'genre' DIV
    Dim nodes As HtmlNodeCollection = node.SelectNodes("div[@class='genre']//a")
    Dim i As Integer
    For i = 0 To nodes.Count - 1
        Console.WriteLine(String.Concat(New Object() { "Genre", (i + 1), ":", nodes.Item(i).InnerText }))
    Next i
    
    ' get to the A tag which is a child or grand child (//) of a 'release' DIV
    Console.WriteLine(("Url   :" & node.SelectSingleNode("div[@class='release']//a").GetAttributeValue("href", CStr(Nothing))))
    

    【讨论】:

    • 非常感谢,“GetAttributeValue”将为我节省空属性。
    猜你喜欢
    • 1970-01-01
    • 2016-12-28
    • 2011-08-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-07-14
    • 1970-01-01
    • 2018-11-30
    相关资源
    最近更新 更多