【问题标题】:Scraping data from website with dynamic array function in vba使用vba中的动态数组函数从网站抓取数据
【发布时间】:2022-01-14 19:39:56
【问题描述】:

我想了解更多有关从网站抓取数据时如何应用数组函数的信息。我目前正在使用这个 vba 从网站复制数据。该代码可以抓取我想要的数据,但是在将数据复制到目标工作表时,它将所有数据复制到A1 单元格。由于这个 vba 是为我之前的项目开发的并且工作正常,我不确定哪个部分出了问题。

Sub CopyFromHKAB()
    Dim ie As Object, btnmore As Object, tbl As Object
    Dim rr As Object, cc As Object, r As Integer, c As Integer, i As Integer, j As Integer
    
    ThisWorkbook.Sheets("data").UsedRange.Clear
    
    Set ie = CreateObject("internetexplorer.application")
    With ie
        .Visible = True
        .navigate "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0"
        
        Do
            DoEvents
        Loop While .readyState <> 4 Or .Busy
          
        
        Set tbl = .document.getElementsByClassName("etxtmed")(2)
            
    End With
    
    'get data from table
    r = tbl.Rows.Length - 1
    c = tbl.Rows(0).Cells.Length - 1
    
    ReDim arr(0 To r, 0 To c)
    
    Set rr = tbl.Rows
    For i = 0 To r
        Set cc = rr(i).Cells
        For j = 0 To c
            arr(i, j) = cc(j).innertext
        Next
    
    Next
    
    ie.Quit
  
    Application.ScreenUpdating = False
    ThisWorkbook.Sheets("Sheet1").Cells(1, 1).Resize(r + 1, c + 1) = arr
    
    With ThisWorkbook.Sheets("data")
        .UsedRange.WrapText = False
        .Columns.AutoFit
    End With
    
End Sub

【问题讨论】:

    标签: html excel vba web-scraping


    【解决方案1】:

    鉴于它们是嵌套的,因此您需要选择正确的表,因此将索引更改为 3。否则,您将选择共享父项,因此所有列表实际上都在一个子元素中,因此是您当前的输出。

    那么你需要调整你的代码来跳过第一行。

    注意您实际上并不需要 IE,因为您想要的内容是静态的。您可以使用 XMLHTTP。而且您正在将数据写到与您结束格式不同的工作表中。

    Sub CopyFromHKAB()
        Dim ie As Object, btnmore As Object, tbl As Object
        Dim rr As Object, cc As Object, r As Integer, c As Integer, i As Integer, j As Integer
        
        ThisWorkbook.Sheets("data").UsedRange.Clear
        
        Set ie = CreateObject("internetexplorer.application")
        With ie
            .Visible = True
            .navigate "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0"
            
            Do
                DoEvents
            Loop While .readyState <> 4 Or .Busy
              
            
            Set tbl = .document.getElementsByClassName("etxtmed")(3)
                    
        End With
        
        'get data from table
        r = tbl.Rows.Length - 1
        c = tbl.Rows(1).Cells.Length - 1
        
        ReDim arr(0 To r, 0 To c)
    
        Set rr = tbl.Rows
    
        For i = 1 To r
        
            Set cc = rr(i).Cells
            For j = 0 To c
                arr(i - 1, j) = cc(j).innertext
            Next
        
        Next
        
        ie.Quit
      
        'Application.ScreenUpdating = False
        ThisWorkbook.Sheets("Sheet1").Cells(1, 1).Resize(r + 1, c + 1) = arr
    
        With ThisWorkbook.Worksheets("data")
            .UsedRange.WrapText = False
            .Columns.AutoFit
        End With
        
    End Sub
    

    我会考虑切换到 XHR 以避免浏览器开销,并使用 querySelectorAll 允许使用 css 选择器列表来仅定位感兴趣的节点

    Option Explicit
    
    Public Sub GetHKABInfo()
        'tools > references > Microsoft HTML Object Library
        Dim html As MSHTML.HTMLDocument, xhr As Object
         
        Set xhr = CreateObject("MSXML2.XMLHTTP")
        Set html = New MSHTML.HTMLDocument
         
        With xhr
            .Open "GET", "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0", False
            .setRequestHeader "User-Agent", "Safari/537.36"
            .send
            html.body.innerHTML = .responseText
        End With
        
        Dim arr() As Variant, nodes As MSHTML.IHTMLDOMChildrenCollection, i As Long
        
        Set nodes = html.querySelectorAll(".etxtmed .etxtmed td")
        
        ReDim arr(1 To nodes.Length - 1)
        
        For i = LBound(arr) To UBound(arr)
            arr(i) = nodes.Item(i).innertext
        Next
        
        ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).Resize(UBound(arr), 1) = Application.Transpose(arr)
        
    End Sub
    

    【讨论】:

    • 感谢您的回答!我把arr(i - 1, j)改回arr(i , j),否则代码会报Run-time error '9': Subscript out of range error。然后代码运行成功。原来我刮错了桌子。一开始我以为是我用的数组函数错了
    • 奇数。不会给我一个下标超出范围的错误,但也许您使用的是不同的 Excel 版本,并且不知何故我们有不同的计数。
    • 顺便说一句,您提到不需要为此使用 IE,您的意思是 XMLHTTP 方法吗?
    • 是的。根据底部代码版本 :-) 我已对其进行了编辑以使其更清晰。道歉。
    • 刚刚测试了XMLHTTP 方法。它比 IE 方法快得令人难以置信!我对网络抓取很陌生,上一个项目涉及一些按钮点击动作的模拟,所以我选择了 IE 方法。这个练习让我熟悉了 IE 方法,对于XMLHTTP,我是全新的。另外你能解释一下最后一行代码ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).Resize(UBound(arr), 1) = Application.Transpose(arr)。我删除了Application.Transpose,代码只复制了第一行数据。非常感谢!
    猜你喜欢
    • 1970-01-01
    • 2015-01-19
    • 1970-01-01
    • 1970-01-01
    • 2019-01-18
    • 1970-01-01
    • 1970-01-01
    • 2021-01-23
    相关资源
    最近更新 更多