【问题标题】:Web page navigation reverting back to page 1网页导航返回到第 1 页
【发布时间】:2021-07-03 08:11:07
【问题描述】:

我正在提取一些黄页数据,这很好。但是我的问题是页面导航。虽然它在尝试导航到第 3 页时从第 1 页导航到第 2 页很好,但我的代码返回到第 1 页并再次提取数据。数据提取很好,问题是导航。

YellowPage.ca

这是我发现的问题,我认为是问题所在,但不知道如何解决。

当页面导航到第 2 页时,“emptyPageButton”的类更改为相同的类以导航到 NEXT PAGE,因此不是前进到下一页,即第 3 页,而是返回到第 1 页。如果我说应该提取 10 页,它会在每页 1 + 2 中提取 5 次,因为它会在两页之间来回切换。

我已经做了几次尝试,但都不起作用。我可以到达第 2 页,然后返回第 1 页

WITH CLASS 工作到第 2 页然后返回到第 1 页

''' Searches Number of Pages entered in Sheet20 rage J9

    If pageNumber >= Replace(Worksheets("Sheet20").Range("J9").Value, "", "+") Then Exit Do
       Set nextPageElement = HTML.getElementsByClassName("ypbtn btn-theme pageButton ")(0)
       'Set nextPageElement = HTML.getElementsByClassName("ypbtn btn-theme pageButton ")(1)
       'Set nextPageElement = HTML.getElementsByClassName("ypbtn btn-theme pageButton ")(0).children (0)
       'Set nextPageElement = HTML.getElementsByClassName("ypbtn btn-theme pageButton ")(1).children (0)
       'Set nextPageElement = HTML.getElementsByClassName("ypbtn btn-theme pageButton ")(1).children (1)
       'Set nextPageElement = HTML.getElementsByClassName("view_more_section_noScroll ")(0).getElementsByTagName("a")(1)
          If nextPageElement Is Nothing Then Exit Do
             nextPageElement.Click 'next web page
             Application.Wait Now + TimeValue("00:00:05")
    

WITH QUERY SELECTOR 工作到第 2 页然后返回到第 1 页

''' Searches Number of Pages entered in Sheet20 rage J9

    If pageNumber >= Replace(Worksheets("Sheet20").Range("J9").Value, "", "+") Then Exit Do
       Set nextPageElement = HTML.querySelector(".view_more_section_noScroll .pageButton")
          If Not nextPageElement Is Nothing Then
             nextPageElement.Click
             Application.Wait Now + TimeValue("00:00:05")
          Else:
             Exit Do
         End If

第 1 页的片段

<div class="view_more_section_noScroll">
  <div class="emptyPageButton"></div>
  <span class="pageCount">
<span class="bold">
1 /
</span>
  <span class="">
37</span>
  </span>
  <a href="/search/si/2/car+dealership/Toronto+ON" data-analytics="{&quot;event_name&quot;:&quot;click - load_more - Serp &quot;,&quot;lk_se_id&quot;:&quot;f32f0ee7-8492-46dd-87da-7b621c162879_Y2FyIGRlYWxlcnNoaXA_VG9yb250byBPTg&quot;,&quot;lk_name&quot;:&quot;next_serp&quot;}"
    class="ypbtn btn-theme pageButton">Next
&gt;&gt;</a>
</div>

第 2 页及以后的代码段

<div class="view_more_section_noScroll">
  <a href="/search/si/1/car+dealership/Toronto+ON" data-analytics="{&quot;event_name&quot;:&quot;click - previous_page - Serp &quot;,&quot;lk_se_id&quot;:&quot;f32f0ee7-8492-46dd-87da-7b621c162879_Y2FyIGRlYWxlcnNoaXA_VG9yb250byBPTg&quot;,&quot;lk_name&quot;:&quot;previous_serp&quot;}"
    class="ypbtn btn-theme pageButton">&lt;&lt; Previous</a>
  <span class="pageCount">
<span class="bold">
2 /
</span>
  <span class="">
37</span>
  </span>
  <a href="/search/si/3/car+dealership/Toronto+ON" data-analytics="{&quot;event_name&quot;:&quot;click - load_more - Serp &quot;,&quot;lk_se_id&quot;:&quot;f32f0ee7-8492-46dd-87da-7b621c162879_Y2FyIGRlYWxlcnNoaXA_VG9yb250byBPTg&quot;,&quot;lk_name&quot;:&quot;next_serp&quot;}"
    class="ypbtn btn-theme pageButton">Next
&gt;&gt;</a>
</div>

问题,有人可以建议导航的正确类或 querySelector 是什么吗?

结果

提前致谢。

'''########################## 2021 年 8 月 4 日更新 ########### ##########

完整的代码很大,我已经减少了很多代码以使其更易于阅读,因为唯一的问题是页面导航。这段代码应该让你知道我想要做什么。目前它覆盖了之前提取的结果,我错误地删除了代码中的某些内容,请暂时忽略它,因为只有页面导航是一个问题

Private Sub YellowPagesCa()

Dim HTML As htmlDocument
Dim objIE As Object
Dim result As String 'string variable that will hold our result link
Dim pageNumber As Long ' page no.
Dim nextPageElement As Object 'page element
Dim HtmlText As Variant ' for html data
Dim wsSheet As Worksheet ' WorkSheet
Dim wb As Workbook
Dim sht As Worksheet

        Set wb = ThisWorkbook
            Set wsSheet = wb.Sheets("YellowPages")
             Set sht = ThisWorkbook.Worksheets("YellowPages")
              
'+++++ Internet Explorer ++++++
        Set objIE = New InternetExplorer 'initiating a new instance of Internet Explorer and asigning it to objIE
        objIE.Visible = True
            objIE.navigate "https://www.yellowpages.ca/search/si/1/car+dealer/Toronto+ON"
            
        Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop 'wait here a few seconds while the browser is busy
        
        Set HTML = objIE.document
        Set elements = HTML.getElementsByClassName("listing_right_section")

    For Each element In elements
            DoEvents
''' Element 1
        If element.getElementsByClassName("listing__name--link listing__link jsListingName")(0) Is Nothing Then
            wsSheet.Cells(sht.Cells(sht.Rows.Count, "A").End(xlUp).Row + 1, "A").Value = "-"
        Else
            HtmlText = element.getElementsByClassName("listing__name--link listing__link jsListingName")(0).href
            wsSheet.Cells(sht.Cells(sht.Rows.Count, "A").End(xlUp).Row + 1, "A").Value = HtmlText
        End If
         
    'End If
Next element

    Do

'''###############      PAGE NAVIGATION    ##############

    'Searches Number of Pages entered in
    If pageNumber >= 5 Then Exit Do 'Replace(Worksheets("Sheet20").Range("J9").Value, "", "+") Then Exit Do

    Set nextPageElement = HTML.querySelector(".view_more_section_noScroll .pageButton")
   ' Set nextPageElement = HTML.getElementsByClassName("ypbtn btn-theme pageButton ")(0)
        If Not nextPageElement Is Nothing Then
           nextPageElement.Click
            Application.Wait Now + TimeValue("00:00:05")
        Else:
            Exit Do
        End If

    Do While objIE.Busy = True Or objIE.readyState <> 4
    DoEvents
    Loop
        Set HTML = objIE.document
        pageNumber = pageNumber + 1
  Loop
                
        objIE.Quit ' end and clear browser
            Set objIE = Nothing
            Set HTML = Nothing
            Set nextPageElement = Nothing
            Set HtmlText = Nothing
            Set element = Nothing
        Complete.show
   'End If
  
End Sub

【问题讨论】:

    标签: excel vba web-scraping screen-scraping


    【解决方案1】:

    你可以循环

    ie.document.querySelectorAll(".pageCount + a").Length <> 0
    

    单击该循环内的next 按钮:

    ie.document.querySelector(".pageCount + a").click
    

    ie.Navigate2 ie.document.querySelector(".pageCount + a").href
    

    当没有更多的next 按钮时,这将终止。


    或者,从第一页提取页数并循环到该页数,将当前页码替换为 url(例如,将 1 替换为 2 以获取第 2 页)

    Option Explicit
    
    Public Sub PrintSomeInfo()
    
        Dim ie As SHDocVw.InternetExplorer, re As Object
    
        Set ie = New SHDocVw.InternetExplorer
        Set re = CreateObject("VBScript.RegExp")
        
        With re
            .Global = False
            .MultiLine = False
            .Pattern = "(si\/)(\d+)(\/)"
        End With
        
        With ie
        
            .Visible = True
            
            .Navigate2 "https://www.yellowpages.ca/search/si/1/car+dealership/Toronto+ON"
            
            While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
            
            Dim pageCount As Long, i As Long
            
            pageCount = CLng(.document.querySelector(".pageCount .bold + span").innerText)
            
            'already on page one so just loop from 2 to pageCount
            For i = 2 To pageCount
                 
                .Navigate2 re.Replace(.document.url, "$1" & CStr(i) & "$3")
                
                While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
                
                'do something with new page
            Next
            
            Stop
           
            .Quit
        End With
    
    End Sub
    

    正则表达式:

    正则表达式匹配 url 中的 3 个组,然后用新的页码替换第二组,即当前页码:

    【讨论】:

    • QHarr 一如既往地感谢您的帮助。我喜欢你的第一种方法,但是由于我有限的 VBA,我正在努力用我现有的导航代码来实现它。请您多解释一下我在哪里以及如何实现您的第一种方法。谢谢
    • 您能否提供足够的代码来重现您的问题?硬编码网址而不是工作表。
    • 请查看我的问题的更新,代码在那里
    • 这个问题现在已经解决了,我已经在我的代码中添加了一些 QHarr 答案并且它已经解决了这个问题,我现在可以正常浏览页面了
    • 请听听
    【解决方案2】:

    感谢 QHarr 的回答,我能够通过使用其中的一部分来解决问题。我已将我的 Class 和 QuerySelector 代码与 QHarr QuerySelector 答案的一部分一起使用。我现在可以正常浏览页面了。

    Do
    ' Searches Number of Pages entered in Sheet20 J9
        If pageNumber >= Replace(Worksheets("Sheet20").Range("J9").Value, "", "+") Then Exit Do
            'Set nextPageElement = HTML.querySelector(".view_more_section_noScroll .pageButton")
            Set nextPageElement = HTML.getElementsByClassName("ypbtn btn-theme pageButton")(0) '' using class and NOT QuerySelector here
            If Not nextPageElement Is Nothing Then
             nextPageElement.document.querySelector(".pageCount + a").Click ''NEW PART
                Application.Wait Now + TimeValue("00:00:05")
            Else:
                Exit Do
            End If
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-29
      • 2015-05-21
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多