【问题标题】:Webscraping in VBA where some HTML information has no way to refer to itVBA中的Web Scraping,其中某些HTML信息无法引用
【发布时间】:2019-11-11 23:55:34
【问题描述】:

我从这个 URL https://accessgudid.nlm.nih.gov/devices/10806378034350 抓取了这个 VBA 脚本

我想要下图中的 LOT、SERIAL 和 EXPIRATION 信息,在 HTML 中带有“是”或“否”。

我如何只返回“是”或“否”信息?

Sub GetInnerInformation(HTMLPage As MSHTML.HTMLDocument)
    Dim HTMLResult As MSHTML.IHTMLElement
    Dim HTMLResults As MSHTML.IHTMLElementCollection
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
    Dim RowNum As Long, ColNum As Integer

    Set HTMLResults = HTMLPage.getElementsByClassName("device-attribute")

    For Each HTMLResult In HTMLResults
        If (HTMLResult.innerText Like "*Lot*") = True Then
            Debug.Print HTMLResult.innerText, HTMLResult.outerText, HTMLResult.innerHTML
        End If
    Next HTMLResult

End Sub

在我的即时窗口中,我得到:

Lot or Batch Number:        Lot or Batch Number:        Lot or Batch Number:

所以不要引用 HTML 中的 Yes 或 No。

【问题讨论】:

    标签: html excel vba web-scraping


    【解决方案1】:

    四处寻找,找到了。我不得不对结果进行一些硬编码,但这就是我得到的。如果您找到了更优雅的答案,请告诉我!

       Sub GetInnerInformation(HTMLPage As MSHTML.HTMLDocument)
            Dim HTMLResult As MSHTML.IHTMLElement
            Dim HTMLResults As MSHTML.IHTMLElementCollection
            Dim HTMLRow As MSHTML.IHTMLElement
            Dim HTMLCell As MSHTML.IHTMLElement
            Dim RowNum As Long, ColNum As Integer
            Dim Lot As Boolean
            Dim Serial As Boolean
            Dim Expiration As Boolean
    
            Set HTMLResults = HTMLPage.getElementsByClassName("expandable-device-content")
    
            For Each HTMLResult In HTMLResults
                If (HTMLResult.innerText Like "*Lot or Batch Number*") = True Then
                    Debug.Print HTMLResult.innerText
    
                    If HTMLResult.innerText Like "*Lot or Batch Number: Yes*" Then
                        Lot = True
                    End If
    
                    If HTMLResult.innerText Like "*Lot or Batch Number: No*" Then
                        Lot = False
                    End If
    
                    If HTMLResult.innerText Like "*Serial Number: Yes*" Then
                        Serial = True
                    End If
    
                    If HTMLResult.innerText Like "*Serial Number: No*" Then
                        Serial = False
                    End If
    
                    If HTMLResult.innerText Like "*Expiration Date: Yes*" Then
                        Serial = True
                    End If
    
                    If HTMLResult.innerText Like "*Expiration Date: No*" Then
                        Serial = False
                    End If
    
                    Debug.Print Lot, Serial, Expiration
                End If
            Next HTMLResult
    
        End Sub
    

    【讨论】:

      【解决方案2】:

      HTML 解析器:

      您可以使用 css attribute = value selector 来定位带有 [?] 的 span,它位于感兴趣的 div 之前。然后使用parentElement 爬上共享父级,并使用NextSibling 移动到感兴趣的div。然后,您可以使用getElementsByTagName 获取labels 节点,并循环该节点列表以写出所需的信息。要获取与标签关联的值,您需要再次使用 NextSibling 来处理父级 div 中的 br 子级。

      我使用 xmlhttp 发出比打开浏览器更快的请求。

      Option Explicit   
      Public Sub WriteOutYesNos()
          Dim html As MSHTML.HTMLDocument
      
          Set html = New MSHTML.HTMLDocument
          With CreateObject("MSXML2.XMLHTTP")
              .Open "GET", "https://accessgudid.nlm.nih.gov/devices/10806378034350", False
              .send
              html.body.innerHTML = .responseText
          End With
      
          Dim nodes As Object, i As Long
      
          Set nodes = html.querySelector("[title*='A production identifier (PI) is a variable']").parentElement.NextSibling.getElementsByTagName("LABEL")
      
          For i = 0 To nodes.Length - 3
              With ActiveSheet
                  .Cells(i + 1, 1) = nodes(i).innerText
                  .Cells(i + 1, 2) = nodes(i).NextSibling.NodeValue
              End With
          Next
      End Sub
      

      JSON 解析器:

      数据也可以作为 json 使用,这意味着您可以使用 json 解析器来处理。我使用 jsonconverter.bas 作为 json 解析器来处理响应。从 here 下载原始代码并添加到名为 JsonConverter 的标准模块中。然后您需要转到 VBE > 工具 > 参考 > 添加对Microsoft Scripting Runtime 的引用。从复制的代码中删除顶部的Attribute 行。

      Option Explicit   
      Public Sub WriteOutYesNos()
          Dim json As Object, ws As Worksheet, results(), i As Long, s As String
      
          Set ws = ThisWorkbook.Worksheets("Sheet1")
          results = Array("lotBatch", "serialNumber", "manufacturingDate")
      
          With CreateObject("MSXML2.XMLHTTP")
              .Open "GET", "https://accessgudid.nlm.nih.gov/devices/10806378034350.json", False
              .send
              Set json = JsonConverter.ParseJson(.responseText)
          End With
      
          With ws
              For i = LBound(results) To UBound(results)
                  .Cells(i + 1, 1) = results(i)
                  .Cells(i + 1, 2).Value = IIf(json(results(i)), "Yes", "No")
              Next
          End With
      End Sub
      

      XML 解析器:

      结果也以 xml 形式出现,只要您适当地处理默认命名空间,您就可以使用 xml 解析器对其进行解析:

      Option Explicit
      Public Sub WriteOutYesNos()
          Dim xmlDoc As Object, ws As Worksheet, results(), i As Long
      
          Set xmlDoc = CreateObject("MSXML2.DOMDocument")
          Set ws = ThisWorkbook.Worksheets("Sheet1")
          results = Array("lotBatch", "serialNumber", "manufacturingDate")
      
          With xmlDoc
              .validateOnParse = True
              .setProperty "SelectionLanguage", "XPath"
              .setProperty "SelectionNamespaces", "xmlns:i='http://www.fda.gov/cdrh/gudid'"
              .async = False
      
              If Not .Load("https://accessgudid.nlm.nih.gov/devices/10806378034350.xml") Then
                  Err.Raise .parseError.ErrorCode, , .parseError.reason
                  Exit Sub
              End If
          End With
      
          With ws
              For i = LBound(results) To UBound(results)
                  .Cells(i + 1, 1) = results(i)
                  .Cells(i + 1, 2).Value = IIf(xmlDoc.SelectSingleNode("//i:" & results(i)).Text, "Yes", "No")
              Next
          End With
      End Sub
      

      【讨论】:

      • 所以我正在实现 XML 解析器方法,但在 .send 行
        Sub XMLV3InnerInformation(URL As String, SearchInput As String, Manuf As String) Dim InnerHTMLDoc As MSHTML.HTMLDocument Dim XMLPage As New MSXML2.XMLHTTP60 XMLPage.Open "GET", URL, False Application.Wait ("0:00:01") XMLPage.send Set InnerHTMLDoc = Nothing InnerHTMLDoc.body.innerHTML = XMLPage.responseText XMLV3GetInnerInformation InnerHTMLDoc, SearchInput, Manuf End Sub 出现错误自动化错误,这是怎么回事?在这里似乎也找不到任何东西。
      • 不,我使用的是 xmlhttp60,我发现 DOMDocument 在等待整个文档加载方面比 xmlhttp60 具有更好的错误处理能力。从好的方面来说,我发现问题发生在我满足网站设置的某个请求阈值的情况下。因此,我再次硬编码 sigh 一个修复程序,我将使用 DOMDocument 方法改进它。当前的修复是每插入 30 条记录,我等待 4 秒然后继续(超过 30 条我太接近该阈值)。我也有点时间紧张,我计算出 IE 需要几天才能完成。
      • 听起来你很高兴这种方法有帮助:-)
      猜你喜欢
      • 2019-11-19
      • 2019-11-14
      • 1970-01-01
      • 2020-10-26
      • 2015-10-21
      • 1970-01-01
      • 1970-01-01
      • 2015-02-10
      • 1970-01-01
      相关资源
      最近更新 更多