【问题标题】:Scrape text from a website using Excel VBA使用 Excel VBA 从网站上抓取文本
【发布时间】:2015-04-23 11:20:33
【问题描述】:

我发现这个article 解释了如何使用 Excel VBA 从网站上抓取某些标签。

下面的代码从它找到的第一个<p>标签中获取内容:

Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim sURL As String
Dim lastrow As Long
lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row

For i = 2 To lastrow
    Set wb = CreateObject("internetExplorer.Application")
    sURL = Cells(i, 1)

    wb.navigate sURL
    wb.Visible = True

    While wb.Busy
        DoEvents
    Wend

    'HTML document
    Set doc = wb.document

    Cells(i, 2) = doc.title

    On Error GoTo err_clear
    Cells(i, 3) = doc.GetElementsByTagName("p")(0).innerText
    err_clear:
    If Err <> 0 Then
        Err.Clear
        Resume Next
    End If
    wb.Quit
    Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit
Next i

End Sub

我想让抓取工具获取网页上&lt;p&gt; 标记内的所有内容。所以我猜想缺少某种foreach 功能。

如何收集多个&lt;p&gt;标签的内容?

更新 工作代码!

Sub get_title_header()
Dim wb As Object
Dim doc As Object
Dim sURL As String
Dim lastrow As Long
Dim i As Integer
lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row

For i = 2 To lastrow
    Set wb = CreateObject("internetExplorer.Application")
    sURL = Cells(i, 1)

    wb.navigate sURL
    wb.Visible = True

    While wb.Busy
        DoEvents
    Wend

    'HTML document
    Set doc = wb.document

    Cells(i, 2) = doc.Title

    On Error GoTo err_clear

    Dim el As Object
    For Each el In doc.GetElementsByTagName("p")

        counter = counter + 1
        Cells(i, counter + 2).Value = Cells(counter + 1).Value & el.innerText

    Next el
    counter = 0

    err_clear:
    If Err <> 0 Then
        Err.Clear
        Resume Next
    End If
    wb.Quit
    Range(Cells(i, 1), Cells(i, 10)).Columns.AutoFit
Next i

End Sub

【问题讨论】:

    标签: excel vba web-scraping


    【解决方案1】:

    你快到了! doc.GetElementsByTagName("p") 返回HTMLParagraphElement 对象的集合,您使用doc.GetElementsByTagName("p")(0) 访问了这些对象的第一个条目。正如您所提到的,For Each 循环将让您依次访问每个:

    Sub get_title_header()
    Dim wb As Object
    Dim doc As Object
    Dim sURL As String
    Dim lastrow As Long
    Dim i As Integer
    lastrow = Sheet1.Cells(Rows.Count, "A").End(xlUp).Row
    
    For i = 2 To lastrow
    Set wb = CreateObject("internetExplorer.Application")
    sURL = Cells(i, 1)
    
    wb.navigate sURL
    wb.Visible = True
    
    While wb.Busy
        DoEvents
    Wend
    
    'HTML document
    Set doc = wb.document
    
    Cells(i, 2) = doc.Title
    
    On Error GoTo err_clear
    
    Dim el As Object
    For Each el In doc.GetElementsByTagName("p")
        Cells(i, 3).Value = Cells(i, 3).Value & ", " & el.innerText
    Next el
    
    err_clear:
    If Err <> 0 Then
    Err.Clear
    Resume Next
    End If
    wb.Quit
    Range(Cells(i, 1), Cells(i, 3)).Columns.AutoFit
    Next i
    
    End Sub
    

    【讨论】:

    • 您好,谢谢,尝试了您的代码,但它不会返回任何内容。是否缺少某些内容?再次感谢!
    • @RobbertT 我已经复制了我测试的代码。这是您的代码的修改版本,为简单起见,它没有任何工作表引用。此外,所有输出都会转到即时窗口 (Ctrl+G)。这在我的 PC 上运行良好,根据您的目的不应该很难修改
    • 太棒了,成功了!但是仍然试图弄清楚如何使它与excel表的a列中的url和b列中的p内容一起工作。你可以下载我提到的文章底部的示例,看看我的意思.谢谢
    • @RobbertT 又一次,你快到了。这只是遍历每个 &lt;p&gt; 标记并将文本连接到已经在第 3 列中的字符串
    • 这太酷了,我不知道 Excel 可以做到这一点。太棒了,像魅力一样工作。又一个要求;是否可以将在单独的单元格(列)中找到的每个

      的内容放在同一行上?谢谢!

    【解决方案2】:

    如果您只需要以纯文本形式获取网页内容,则此代码更简洁

    Function WEBSITE_TEXT(Destination As String) As String
    ' Requires a reference to Microsoft XML, v6.0
    ' Draws on the stackoverflow answer at bit.ly/parseXML
    Dim myRequest As XMLHTTP60
    Dim myDomDoc As DOMDocument60
    
        ' Check and clean inputs
        On Error GoTo exitRoute
    
        If Destination = "" Then
          WEBSITE_TEXT = ""
         Exit Function
        End If
    
        ' Read the XML data from the Google Maps API
        Set myRequest = New XMLHTTP60
        myRequest.Open "GET", Destination, False
        myRequest.send
    
        ' Parse HTML content
        Dim html As New HTMLDocument
        Dim text As String
        html.body.innerHTML = myRequest.responseText
    
        ' Return the website content
        text = html.body.innerText
        If Not html Is Nothing Then WEBSITE_TEXT = text
    exitRoute:
        ' Tidy up
        text = ""
        Set myRequest = Nothing
    End Function
    

    【讨论】:

    • 这行代码出现“用户定义类型未定义”错误:Dim html As HTMLDocument
    • 啊:除了引用Microsoft XML, v6.0 库外,还要添加对Microsoft HTML Object Library 的引用
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-11-21
    相关资源
    最近更新 更多