【问题标题】:Is there a way to slow down a Web Scraper so it will pick up the code?有没有办法减慢 Web Scraper 的速度,以便它可以拾取代码?
【发布时间】:2019-07-07 14:52:37
【问题描述】:

我写了一个宏去 WU 来获取历史数据,而且大部分情况下,它可以工作。但是,我认为宏运行速度太快,无法从网站上获取数据。

https://www.wunderground.com/history/daily/us/tx/el-paso/KELP/date/2017-1-3 我想要获取的网站和表格是否可以进行tablesaw-sortable。

我尝试了以下方法:DoEventsApplication.Wait (Now + TimeValue("00:00:01")) 尝试减慢进程。

Sub BrowseToWU()

    Dim IE As New SHDocVw.InternetExplorer
    Dim HTMLDoc As MSHTML.HTMLDocument
    Dim RowAddress   As Integer
    Dim WebAddress As String
    Dim DateSheet As Date
    Dim WkDay As Integer
    Dim DateSheetName As String

    'Application.ScreenUpdating = False
    'Application.StatusBar = True
    RowAddress = 2
    IE.Visible = True
    Do Until RowAddress = 60

    WebAddress = Range("A" & RowAddress)
    DateSheet = Right(WebAddress, 8)
    DateSheetName = Right(WebAddress, 8)
    WkDay = Weekday(DateSheet, vbSunday)

    If WkDay < 3 Then
        RowAddress = RowAddress + 1

        ElseIf WkDay > 6 Then
            RowAddress = RowAddress + 1

        Else

        IE.Navigate WebAddress

            Do While IE.ReadyState <> READYSTATE_COMPLETE
            Loop


            Set HTMLDoc = IE.Document
            DoEvents

            Application.Wait (Now + TimeValue("00:00:05"))
            DoEvents

            ProcessHTMLPage HTMLDoc

            DateSheet = Right(WebAddress, 8)
            DoEvents
            Application.Wait (Now + TimeValue("00:00:01"))
            ActiveSheet.Name = DateSheetName

            DoEvents

            RowAddress = RowAddress + 1
            'IE.Quit

            Worksheets("Sheet1").Activate
        End If

    Loop

End Sub
Option Explicit

Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    Dim HTMLTable As MSHTML.IHTMLElement
    Dim HTMLTables As MSHTML.IHTMLElementCollection
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
    Dim RowNum As Long, ColNum As Integer
    'Dim IE As New SHDocVw.InternetExplorer
    'Dim Ws As Worksheet

    Set HTMLTables = HTMLPage.getElementsByClassName("tablesaw-sortable")
    'DoEvents

    For Each HTMLTable In HTMLTables

        Worksheets.Add
        DoEvents

        Range("A1").Value = HTMLTable.className
        Range("B1").Value = Now

        RowNum = 2

        For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
            'Debug.Print vbTab & HTMLRow.innerText

            ColNum = 1
            For Each HTMLCell In HTMLRow.Children
                Cells(RowNum, ColNum) = HTMLCell.innerText
                ColNum = ColNum + 1


            Next HTMLCell
                RowNum = RowNum + 1
        Next HTMLRow
    Next HTMLTable
    DoEvents

    'IE.Quit

End Sub
  1. 如果满足一周中某一天的条件,该宏应该通过 sheet1 获取历史数据的网址。

  2. IE 将打开,然后跳转到下一个接收数据的模块。

  3. 创建一个新工作表并将数据粘贴到新工作表中。

  4. 工作表重命名为数据的日期。

  5. 网址表再次激活,流程重新开始。

我得到的错误是数据不是取自网站,所以For语句结束,网址表被重命名,出现错误。

【问题讨论】:

  • 哪一行出现错误,错误信息是什么?是否可以将上面的代码减少到仅足以重现问题的行数?
  • 我相信 ProcessHTMLPage HTMLDoc 启动时会发生错误。当它切换到该子程序时,什么也没有被拾取。我不确定是否可以减少代码。
  • 所以没有错误信息?如果出现错误消息,则单击调试时应突出显示代码行(取决于您的错误处理设置)。
  • 从使用适当的页面加载等待开始。 While ie.Busy 或 ie.readyState
  • 这一切都取决于研究所需的数据。可能有一个月的周二至周四或一整年。

标签: excel vba


【解决方案1】:

解决此问题的一种方法是调用页面用于获取该信息的 API。

API 返回 json,您可以使用 json 解析器对其进行解析。我使用jsonconverter.bas。在名为 JsonConverter 的标准模块中安装该链接中的代码后,转到 VBE > 工具 > 参考 > 添加对 Microsoft Scripting Runtime 的引用。


寻找 API:

如果您按 F12 打开开发人员工具并转到 Network 选项卡,然后按 F5 刷新任何感兴趣的 url,您将看到记录的网络流量.您可以在那里找到 API 调用。

请参阅我的回答here,了解如何使用您希望在响应中看到的特定观察值搜索网络流量 - 这会将网络流量列表过滤到包含感兴趣值的那些项目。在选择值时要明智——您希望在其他地方不太可能发生的事情。您还可以将网络流量过滤到仅XHR


API 响应:

API 返回 json。更具体地说,它返回一个包含 2 个键的字典。第二个键“observations”可用于返回字典(用{} 表示)的集合(用[] 表示)。 每个字典代表表格的一行(每日观察)。您可以循环此集合,然后循环内部字典,以访问表行值并通过填充数组来重建表。探索示例 json 响应 here


json结构说明:

点击here放大


代码说明:

代码被分解成许多辅助子和函数,为每个子分配特定的任务, 使代码更易于调试和遵循,并更好地符合面向对象的编程原则。

整个过程是:

  1. 收集Worksheet("Sheet1") 的网址。辅助函数GetAllUrls
  2. 处理这些网址并仅保留与周二至周四相对应的日期。这些保存为格式为"yyyymmdd" 的字符串,因此可以稍后传递给API。这由辅助函数 GetOnlyQualifyingUrlsDatesIncludeThisDate 处理。 IncludeThisDate 进行是否包含的检查; GetOnlyQualifyingUrlsDates 处理结果的循环和格式化。
  3. 通过循环限定 url 日期并将这些日期连接到 API 调用的 url 来发出 xmlhttp 请求,然后发出请求。这是由主子GetTables 执行的。
  4. 用于输出的工作表创建由辅助函数CreateWorksheet 处理。此函数调用另一个辅助函数SheetExists,以确保仅在工作表不存在时才创建工作表,否则使用该名称的现有工作表。
  5. 从第 3 步得到的 json 响应被传递给辅助子 WriteOutResults,它接受 json 变量和输出工作表对象作为参数。它从 json 响应中提取所有信息;本质上是重建表。它将表格和标题添加到适当的工作表中。 它调用辅助函数Epoch2Date,该函数处理json对象中两个unix字段的unix时间戳到日期时间的转换。

待办事项:

  1. API 密钥可能有时间限制。添加一个返回当前有效密钥的辅助函数。
  2. API 接受 url 结构中的开始日期和结束日期参数。如果可能的话,最好为整个范围发出一个请求,或者分块范围,例如月,以减少请求的数量。这也将减少被阻止的可能性。这意味着在写出结果之前需要编写一些额外的代码,以确保只将感兴趣的日期写入工作表。虽然你可以写出所有然后简单地循环所有工作表并删除那些不需要的工作表(如果我们谈论总共 365 个日期,这是完全可行的)。就个人而言,我会从单个请求(如果可能)处理表构建中的包含日期部分,该请求将列出的整个 url 的最小和最大日期作为开始和结束日期参数传递。然后我会在一张纸上写一个平面表格,因为这对于以后的数据分析会容易得多。

VBA:

Option Explicit

Public Sub GetTables()
    'VBE > Tools > References > Microsoft Scripting Runtime
    Dim json As Object, qualifyingUrlsDates(), urls(), url As String
    Dim ws As Worksheet, wsOutput As Worksheet, i As Long, startDate As String, endDate As String

    Set ws = ThisWorkbook.Worksheets("Sheet1")
    urls = GetAllUrls(2, ws, "A")
    qualifyingUrlsDates = GetOnlyQualifyingUrlsDates(urls)

    'API key may be not be valid over time so look at obtaining by prior request

    With CreateObject("MSXML2.XMLHTTP")          'issue xmlhttp request for each valid date (this would be better done using start and enddate to specify entire range _
                                                 of batches e.g. months within total range to cut down on requests
        For i = LBound(qualifyingUrlsDates) To UBound(qualifyingUrlsDates)
            startDate = qualifyingUrlsDates(i)
            endDate = startDate                 ' a little verbose but useful for explaining
            url = "https://api.weather.com/v1/geocode/31.76/-106.49/observations/historical.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&startDate=" & startDate & "&endDate=" & endDate & "&units=e"
            .Open "GET", url, False
            .send
            Set json = JsonConverter.ParseJson(.responseText)("observations")
            Set wsOutput = CreateWorksheet(qualifyingUrlsDates(i))
            WriteOutResults wsOutput, json
        Next
    End With
End Sub

Public Sub WriteOutResults(ByVal wsOutput As Worksheet, ByVal json As Object)
'json is a collection of dictionaries. Each dictionary is a time period reading from the day i.e. one row in output
    Dim results(), item As Object, headers(), r As Long, c As Long, key As Variant
    headers = json.item(1).keys 'get the headers which are the keys of each dictionary
    ReDim results(1 To json.Count, 1 To UBound(headers) + 1)
    For Each item In json
        r = r + 1: c = 0 'increase row in results array to store results for table row
        For Each key In item.keys
            c = c + 1 'increase column number in results array for writing out results
            Select Case key
            Case "valid_time_gmt", "expire_time_gmt" 'convert unix timestamp fields to datetime
                results(r, c) = Epoch2Date(item(key))
            Case Else
                results(r, c) = item(key)
            End Select
        Next
    Next
    With wsOutput
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetOnlyQualifyingUrlsDates(ByVal urls As Variant) As Variant
    Dim i As Long, output(), counter As Long
    ReDim output(1 To UBound(urls))

    For i = LBound(urls) To UBound(urls)
        If IncludeThisDate(urls(i)) Then 'check if weekday is to be included
            counter = counter + 1
            output(counter) = Format$(Right$(urls(i), 8), "yyyymmdd") 'if to include then add to output array of urls of interest
        End If
    Next
    ReDim Preserve output(1 To counter)
    GetOnlyQualifyingUrlsDates = output
End Function

Public Function IncludeThisDate(ByVal url As String) As Boolean
    'tue, wed, thurs are valid
    IncludeThisDate = Not IsError(Application.Match(Weekday(Right$(url, 8), vbSunday), Array(3, 4, 5)))
End Function

Public Function SheetExists(ByVal sheetName As String) As Boolean '<==  function by @Rory
    SheetExists = Evaluate("ISREF('" & sheetName & "'!A1)")
End Function

Public Function GetAllUrls(ByVal startRow As Long, ByVal ws As Worksheet, ByVal columnName As String) As Variant
    'transpose used based on premise no more than a couple of years of dates
    'startRow is start row for urls, ws is sheet where urls found, columnName is string representation of column for urls e.g. "A"
    With ws
        GetAllUrls = Application.Transpose(ws.Range("A" & startRow & ":A" & .Cells(.rows.Count, columnName).End(xlUp).Row).Value)
    End With
End Function

Public Function CreateWorksheet(ByVal sheetName As String) As Worksheet
    Dim ws As Worksheet
    If SheetExists(sheetName) Then
        Set ws = ThisWorkbook.Worksheets(sheetName)
        'do something.... clear it? Then add new data to it?
    Else
        Set ws = ThisWorkbook.Worksheets.Add
        ws.Name = sheetName
    End If
    Set CreateWorksheet = ws
End Function

Public Function Epoch2Date(ByVal E As Currency, Optional msFrac) As Date '@ Schmidt http://www.vbforums.com/showthread.php?805245-EPOCH-to-Date-and-vice-versa
    Const Estart As Double = #1/1/1970#
    msFrac = 0
    If E > 10000000000@ Then E = E * 0.001: msFrac = E - Int(E)
    Epoch2Date = Estart + (E - msFrac) / 86400
End Function

【讨论】:

  • 非常感谢您的快速回复。显然我是新手。我需要将 JsonConverter 放入另一个模块吗?
  • 非常感谢您分享您的知识+1
猜你喜欢
  • 2017-05-20
  • 2020-12-04
  • 2013-05-10
  • 1970-01-01
  • 2012-04-12
  • 1970-01-01
  • 2012-03-21
  • 2021-06-27
  • 1970-01-01
相关资源
最近更新 更多