VBA 在不使用 IE 自动化解析 IE.Document 的情况下获取 URL 位置答案

【问题标题】：VBA Getting to a URL Location without parsing the IE.Document with IE AutomationVBA 在不使用 IE 自动化解析 IE.Document 的情况下获取 URL 位置
【发布时间】：2014-03-30 20:09:17
【问题描述】：

在下面的代码中，我们使用 IE 自动化从这里获取

位置 1

"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & Ticker & "&type=10-Q&dateb=&owner=exclude&count=20"

到这样的位置

位置 2

https://www.sec.gov/Archives/edgar/data/10795/000119312514042815/bdx-20131231.xml

有没有办法从位置 1 到位置 2，而不使用 IE 自动化并找到更可靠、安全和更快的东西？

为了完整起见，这里是我们现在拥有的完整代码；通过运行，您会看到大量使用 IE：

Option Explicit

Sub MadMule2()
    Dim IE As InternetExplorer
    Dim el
    Dim els
    Dim colDocLinks As New Collection
    Dim Ticker As String
    Dim lnk
    Dim intCounter as Integer    

    Set IE = New InternetExplorer

    IE.Visible = False

    Ticker = Worksheets("Sheet1").Range("A1").Value

    LoadPage IE, "https://www.sec.gov/cgi-bin/browse-edgar?" & _
                  "action=getcompany&CIK=" & Ticker & "&type=10-Q" & _
                  "&dateb=&owner=exclude&count=20"

    Set els = IE.document.getElementsByTagName("a")
    For Each el In els
        If Trim(el.innerText) = "Documents" Then
            colDocLinks.Add el.href
        End If
    Next el

    intCounter = 1

     For Each lnk In colDocLinks
        LoadPage IE, CStr(lnk)
        For Each el In IE.document.getElementsByTagName("a")
            If el.href Like "*[0-9].xml" Then
                ActiveWorkbook.XmlMaps.Add(el, "xbrl").Name = "xbrl Map"
            End If
        Next el
    Next lnk
End Sub

Sub LoadPage(IE As InternetExplorer, URL As String)
    IE.navigate URL
    Do While IE.Busy Or IE.readyState <> READYSTATE_COMPLETE
        DoEvents
    Loop
End Sub

附加

问：Is there a way to go from location 1 to location 2 without using IE Automation and finding something more reliable, secure and faster?

你能扩展一下吗？

按我的意思

A：这是对我们收到的 user2140261 here 的代码块的评论：

你应该看看 MSXML，它比 IE 自动化更快、更安全、更可靠。

由于代码打开 Internet Explorer，解析源页面以查找 href 并到达所需的 Web 位置；我们想知道是否有办法在不使用 IE 的情况下进入位置 2。可以像 user2140261 所说的那样使用 MSXML 来完成吗？

【问题讨论】：

Is there a way to go from location 1 to location 2 without using IE Automation and finding something more reliable, secure and faster? 你能详细说明一下吗？你有什么具体的想法吗？ going from location 1 to location 2 是什么意思？
我已经修改了我的帖子。如果解释充分，请通知我
使用 Ajax：stackoverflow.com/questions/8044423/json-import-to-excel 我的答案在这里：stackoverflow.com/questions/18835640/…
我刚看到。我会仔细研究
url 是服务器上存储 xml 文件的位置。 Ajax 调用是对在后台运行的 url 的 HTTP 请求，它连接到服务器而不通过客户端（爬取 DOM），就像您在示例中所做的那样。在我发布的链接示例中，返回了 JSON 对象，这就是使用 json 解析器作为示例的原因。如果您使用 XML，您可能需要在 VBA 中寻找 XML 解析器。

标签： xml vba excel web-scraping

【解决方案1】：

这里是一个使用XHR的例子

Option Explicit
Public Sub GetLinks()
    Dim ticker As String, html As New HTMLDocument, links As Object, i As Long
    ticker = [A1]                                'example is 81251
    Set html = GetHTMLDocument("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & ticker)
    If html Is Nothing Then Exit Sub
    Set links = html.querySelectorAll("#documentsbutton")
    If links Is Nothing Then Exit Sub
    For i = 0 To links.Length - 1
        Debug.Print GetAbsoluteURL(links(i).getAttribute("href"))
    Next i
End Sub

Public Function GetAbsoluteURL(ByVal relativeURL As String) As String
    If InStr(relativeURL, "about:/") > 0 Then
        GetAbsoluteURL = Replace$(relativeURL, "about:/", "https://www.sec.gov/")
    Else
        GetAbsoluteURL = relativeURL
    End If
End Function

Public Function GetHTMLDocument(ByVal URL As String) As HTMLDocument
    Dim sResponse As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .send
        If .Status <> 200 Then
            Exit Function
        Else
            sResponse = StrConv(.responseBody, vbUnicode)
        End If
    End With
    Set GetHTMLDocument = New HTMLDocument
    GetHTMLDocument.body.innerHTML = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
End Function

_{示例代码为 81251}

示例输出：

CSS 选择器和 .querySelectorAll

您可以使用CSS selector 或"#documentsbutton" 来定位文档按钮。现在，虽然# 表示 id，并且 id 通常是唯一的，但对于这个特定的网站，似乎使用相同的 id 来标识表中的所有元素。

CSS 选择器查询从页面返回以下（示例）：

在 VBA 中应用 CSS 选择器：

由于匹配多个元素，document 的 .querySelectorAll 方法用于返回由 css 选择器匹配的项目的 nodeList：

html.querySelectorAll("#documentsbutton")

我们可以遍历nodeList，沿着它的.Length，并通过索引（沿着nodeList的位置；从0开始）访问单个文档button元素：

For i = 0 To links.Length - 1

如果您想要超链接信息，您可以简单地从每个匹配的元素中提取 href 属性：

links(i).getAttribute("href")

这会返回一个相对路径，因此我们使用一个很小的辅助函数 GetAbsoluteURL 来获取绝对路径。

结束语：

虽然不完全遵守Single Responsibility Principle，但这确实显示了重构代码的一些好处。然后可以重用GetHTMLDocument函数来处理GetAbsoluteURL返回的新URL

【讨论】：