【问题标题】:Why's my xpath returning both tables despite specifying the index?尽管指定了索引,为什么我的 xpath 会返回两个表?
【发布时间】:2019-09-12 08:10:54
【问题描述】:

我正在使用 Fuzi Swift 库来解析this hackernews page

我只需要提取包含主要帖子详细信息的帖子的顶部描述(即“也许 HN 可以帮助解决这个小谜团......low.com/a/55711457/2251982)

附上截图:

这是我的xpath 代码:

print("Description: \(String(describing: document.xpath("//*[@id=\"hnmain\"]//tr[2]/td/table[1]//tr[4]/td").first?.rawXML))")

但我的输出一直显示两个表格,即顶部的帖子以及评论表格:

Description: Optional("<td>Maybe HN can help solve this little mystery. The default font sizes in HTML have, since at least 1998 [1], been .83em and .67em for h5 and h6, respectively, making them smaller than normal text by default (1em). This leads to the bizarre situation that without any styling, the h5 and h6 headings are smaller than the text they head!<p>Does anyone know why headings were made smaller than normal text? I bet the answer is buried in some mailing list from the mid 90s, but so far my searches have not been fruitful. Perhaps someone here was around at the time of, or was even involved in, this decision.<p>[1] https://stackoverflow.com/a/55711457/2251982</p></p>\n        <tr style=\"height:10px\"/><tr><td colspan=\"2\"/><td>\n          <form method=\"post\" action=\"comment\"><input type=\"hidden\" name=\"parent\" value=\"19722704\"><input type=\"hidden\" name=\"goto\" value=\"item?id=19722704\"><input type=\"hidden\" name=\"hmac\" value=\"78883e7dccb14e8eed04ba1f3b825085ecd4c545\"><textarea name=\"text\" rows=\"6\" cols=\"60\"/>\n                <br><br><input type=\"submit\" value=\"add comment\"/>\n      </br></br>\n  </input><br><br>\n  <table border=\"0\" class=\"comment-tree\">\n            <tr class=\"athing comtr \" id=\"19725000\"><td>\n            <table border=\"0\">  <tr>    <td class=\"ind\"><img src=\"s.gif\" height=\"1\" width=

为什么还要选择第二张桌子?

【问题讨论】:

    标签: html parsing xpath web-scraping xml-parsing


    【解决方案1】:

    //td/table[1] 表示选择每个table,即td 元素的第一个子元素,而(//td/table)[1] 表示选择每个table,即td 元素的子元素,然后从所有这些,选择第一个。具体来说,运算符x[y]x/y(或x//y)绑定得更紧密,所以x//y[1] 表示x//(y[1]),而不是(x//y)[1]

    【讨论】:

    • 我尝试使用print("Description: \(document.xpath("(//*[@id=\"hnmain\"]//tr[2]/td/table)[1]//tr[4]/td").first?.rawXML)"),但这仍然给了我顶部和底部的表格。注意我在表格之后和[1] 之前有括号。抱歉,我对 xpath 很陌生,不知道这有什么问题。
    【解决方案2】:

    我能够解决我的问题。事实证明 Fuzi 库有错误,并且它的 xpath 解析器无法正常工作。

    我切换到了 Kanna 库,它运行良好且准确:

    https://github.com/tid-kijyun/Kanna

    我的卡纳代码:

    let myRequest = NSMutableURLRequest(url: URL(string: "https://news.ycombinator.com/item?id=19722704")!)
    
    let dataTask : URLSessionTask = URLSession.shared.dataTask(with: myRequest as URLRequest, completionHandler: { data, response, error in
    
        guard error == nil else {
            return
        }
    
        guard let data = data else {
            return
        }
    
        if let htmlString = String(bytes: data, encoding: String.Encoding.utf8), let doc = try? HTML(html: htmlString, encoding: .utf8) {
    
            for postDescription in doc.xpath("//*[@id=\"hnmain\"]//tr[3]/td/table[1]//tr[4]/td[2]") {
                print("postDescription: \(String(describing: postDescription.content))")
            }
    
            for comment in doc.xpath("//table[@class=\"comment-tree\"]//tr") {
                print("Comment: \(String(describing: comment.content))")
            }
        }
    
    })
    dataTask.resume()
    

    【讨论】:

    • 这不是 Fuzi 的错误,而是您对它们使用不同的 Xpath 查询,因此它们当然会产生不同的结果。对于 Fuzi,您使用的查询是: //*[@id=\"hnmain\"]//tr[2]/td/table[1]//tr[4]/td 对于 Kanna,您使用的查询是: //*[@id=\"hnmain\"]//tr[3]/td/table[1]//tr[4]/td[2] 我刚刚尝试使用第二个查询解析页面并且两个库产生相同的结果。
    猜你喜欢
    • 2021-03-09
    • 2021-02-10
    • 1970-01-01
    • 1970-01-01
    • 2011-10-11
    • 2017-12-26
    • 1970-01-01
    • 1970-01-01
    • 2017-05-22
    相关资源
    最近更新 更多