如何通过 cURL 仅获取页面的前 40KB答案

【问题标题】：How can I fetch only the first 40KB of a page via cURL如何通过 cURL 仅获取页面的前 40KB
【发布时间】：2017-09-16 11:07:19
【问题描述】：

所以我不想拉整个页面，只拉页面的前 40KB。就像这个Facebook Debugger 工具一样。

我的目标是获取社交媒体元数据，即og:image 等。

可以是任何编程语言，PHP 或 Python。

我在 phpQuery 中确实有使用 file_get_contents/cURL 的代码，并且我知道如何解析收到的 HTML，我的问题是 “如何在不获取整个页面的情况下仅获取页面的第一个 nKB”

【问题讨论】：

也许这会有所帮助stackoverflow.com/a/12014561/661872
@LawrenceCherone 我在 phpQuery 中确实有使用 file_get_contents/cURL 的代码，并且我知道如何解析收到的 HTML，我的问题是 “如何仅获取页面的第一个 nKB 而不获取整个页面页”
这似乎已经回答了here。
--range curl 命令行选项似乎很合适，但并没有说太多细节curl.haxx.se/docs/manpage.html
公平地说，你可以考虑使用 curl 和 CURLOPT_WRITEFUNCTION，它在读取 40KB 后中止，你也可以在点击 </head> 之前中止

标签： php python curl

【解决方案1】：

这并不特定于 Facebook 或任何其他社交媒体网站，但您可以使用 python 获得前 40 KB，如下所示：

import urllib2
start = urllib2.urlopen(your_link).read(40000)

【讨论】：

这是否会在前 40 KB 到达时停止加载页面？
@Umair 它只会读取前 40KB。所以，是的，它在那之后就停止了。

【解决方案2】：

这个可以用：

curl -r 0-40000 -o 40k.raw https://www.keycdn.com/support/byte-range-requests/

-r 代表范围：

来自 curl 手册页：

r, --range <range>
          (HTTP FTP SFTP FILE) Retrieve a byte range (i.e a partial document) from a HTTP/1.1, FTP or SFTP server or a local  FILE.  Ranges  can  be
          specified in a number of ways.

          0-499     specifies the first 500 bytes

          500-999   specifies the second 500 bytes

          -500      specifies the last 500 bytes

          9500-     specifies the bytes from offset 9500 and forward

          0-0,-1    specifies the first and last byte only(*)(HTTP)

更多信息可以在这篇文章中找到：https://www.keycdn.com/support/byte-range-requests/

以防万一这是如何使用go 进行操作的基本示例

package main

import (
    "fmt"
    "io"
    "io/ioutil"
    "log"
    "net/http"
)

func main() {
    response, err := http.Get("https://google.com")
    if err != nil {
        log.Fatal(err)
    }
    defer response.Body.Close()
    data, err := ioutil.ReadAll(io.LimitReader(response.Body, 40000))
    fmt.Printf("data = %s\n", data)
}

【讨论】：