如何获取通向下一页的所有链接？答案

【问题标题】：How to get all the links leading to the next page?如何获取通向下一页的所有链接？
【发布时间】：2018-01-03 20:47:39
【问题描述】：

我在 vba 中编写了一些代码，以获取从网页指向下一页的所有链接。下一页链接的最高数量是 255。运行我的脚本，我得到了 6906 个链接中的所有链接。这意味着循环一次又一次地运行，我正在覆盖东西。过滤掉重复链接我可以看到有 254 个唯一链接。我的目标不是将最高页码硬编码到链接以进行迭代。这是我正在尝试的：

Sub YifyLink()
    Const link = "https://www.yify-torrent.org/search/1080p/"
    Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
    Dim x As Long, y As Long, item_link as String

    With http
        .Open "GET", link, False
        .send
        html.body.innerHTML = .responseText
    End With

    For Each post In html.getElementsByClassName("pager")(0).getElementsByTagName("a")
        If InStr(post.innerText, "Last") Then
            x = Split(Split(post.href, "-")(1), "/")(0)
        End If
    Next post
    For y = 0 To x
        item_link = link & "t-" & y & "/"

        With http
            .Open "GET", item_link, False
            .send
            htm.body.innerHTML = .responseText
        End With
        For Each posts In htm.getElementsByClassName("pager")(0).getElementsByTagName("a")
            I = I + 1: Cells(I, 1) = posts.href
        Next posts
    Next y
End Sub

链接所在的元素：

<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>

我得到的结果（部分）：

about:/search/1080p/t-20/
about:/search/1080p/t-21/
about:/search/1080p/t-22/
about:/search/1080p/t-23/
about:/search/1080p/t-255/

【问题讨论】：

为什么要从页面上抓取链接（特别是因为它们不在页面上）？为什么不自己生成呢？
因为，我尝试使用的页面可能不在第一页。我如何才能对最后一个数字进行硬编码以进行迭代？
您当前正在抓取的页面似乎告诉您存在的最高页码 (255)，所以您不能只抓取那个数字，然后从 1 循环到该数字以生成所有255 个链接？
这就是我想做的。但是，我无法正确地整合这个想法以使其发挥作用。

标签： vba web-scraping web-crawler

【解决方案1】：

想法应该是在循环中抓取页面并找到要比较的东西，如果不正确，则退出循环。

这可能是，即根据字典检查键，或检查元素是否存在，或任何其他可能特定于您的问题的逻辑。

例如，您的问题是，该站点一直在为后面的页面显示第 255 页。所以这对我们来说是一个线索。我们可以将属于页面 (n) 的元素与属于页面 (n-1) 的元素进行比较。

例如，如果第 256 页中的元素与第 255 页中的元素相同，则退出循环/子。请看下面的示例代码：

Sub yify()
Const mlink = "https://www.yify-torrent.org/search/1080p/t-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, posts As Object
Dim pageno As Long, rowno As Long

pageno = 1
rowno = 1

Do
    With http
        .Open "GET", mlink & pageno & "/", False
        .send
        html.body.innerHTML = .responseText
    End With

    Set posts = html.getElementsByClassName("mv")
    If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText Then Exit Do

    For Each post In posts
        With post.getElementsByTagName("div")
            If .Length Then
                rowno = rowno + 1
                Cells(rowno, 1) = .Item(0).innerText
            End If
        End With
    Next post
    Debug.Print "pageno: " & pageno & " completed."
    pageno = pageno + 1
Loop
End Sub

【讨论】：

感谢 Tehscript，为您提供解决方案。它永远不会出轨。恭喜你的新成就。我想知道它花费的时间比我预期的要长。我希望我很快就能看到 10 代替 1。顺便说一句，我有一个惊人的脚本正在运行，您将获得所有链接不到一秒的时间。这里是：“dropbox.com/s/2na6nfvipmsobat/For%20Tehscript.txt?dl=0”
谢谢SMth80，有时间我去看看。