获取网页大小的 Ruby 代码（以字节为单位）答案

【问题标题】：Ruby code to get size in bytes of a webpage获取网页大小的 Ruby 代码（以字节为单位）
【发布时间】：2012-07-25 01:58:50
【问题描述】：

我想计算网页的大小（以字节为单位），例如 www.google.com 的大小约为 44kb，facebook.com 的大小约为 17k。我尝试过 Nokogiri 来计算 HTML 的长度，但它为 Google 提供了 8k，为 Facebook 提供了 32k。我不想使用任何第三方工具，我想在我的应用程序中计算它。

【问题讨论】：

如果 HTML 是/可以通过 'net 压缩发送，你想要压缩数据的大小，还是响应中的原始未压缩大小？
Nokogiri 不是用于此目的的工具。它只是一个 XML/HTML 解析器。

标签： ruby ruby-on-rails-3 nokogiri page-size

【解决方案1】：

此代码示例应该能让您顺利上路。它下载站点，并使用长度方法检索大小。

require 'net/http'
require 'fileutils' #I'm pretty sure this is needed for the delete method  
  class HttpSample  
  def downloadGoogleHome  
    proxy = Net::HTTP::Proxy('ipaddress', portnumber) # use actual ip and port  
    url = URI.parse('http://www.google.com')  
    http_response = proxy.get_response(url) 
    puts http_response.body.length #size
  end
  s = HttpSample.new  
  s.downloadGoogleHome  
end

【讨论】：

无需将某些内容保存到文件中。您可以使用http_response.body.length 方法获取检索到的数据的大小（以字节为单位）。
你想要页面使用的所有html、框架、子框架、图像、css和js的大小，还是只需要html？
@paperids 我试过你的代码，但它对我不起作用..它为零
@paperids 我在端口号部分输入什么
这里的端口号应该是HTTP->80。

【解决方案2】：

使用Net::HTTP::Head 允许您向服务器询问有关页面的信息，而不必返回该页面并浪费它们以及您的带宽和CPU 时间。返回的标头之一应该是Content-Length:

require 'net/http'
request = Net::HTTP.new('google.com', 80)
head = request.request_head('/')

#<Net::HTTPMovedPermanently:0x102157ae0
    @body_exist = false,
    @read = true,
    @socket = nil,
    attr_accessor :body = nil,
    attr_reader :code = "301",
    attr_reader :header = {
                "location" => [
            [0] "http://www.google.com/"
        ],
            "content-type" => [
            [0] "text/html; charset=UTF-8"
        ],
                    "date" => [
            [0] "Thu, 26 Jul 2012 17:46:30 GMT"
        ],
                 "expires" => [
            [0] "Sat, 25 Aug 2012 17:46:30 GMT"
        ],
           "cache-control" => [
            [0] "public, max-age=2592000"
        ],
                  "server" => [
            [0] "gws"
        ],
          "content-length" => [
            [0] "219"
        ],
        "x-xss-protection" => [
            [0] "1; mode=block"
        ],
         "x-frame-options" => [
            [0] "SAMEORIGIN"
        ],
              "connection" => [
            [0] "close"
        ]
    },
    attr_reader :http_version = "1.1",
    attr_reader :message = "Moved Permanently"
>

这是一个重定向，表明浏览器需要寻找其他地方。

遗憾的是，并非所有 HTTPd 都返回 content-length 标头，因为页面可能是动态创建的，因此在内容实际呈现和发送之前无法做出准确的猜测。

在上述重定向之后，使用另一个 HEAD 请求会导致：

#<Net::HTTPOK:0x10217e8c0
    @body_exist = false,
    @read = true,
    @socket = nil,
    attr_accessor :body = nil,
    attr_reader :code = "200",
    attr_reader :header = {
              "set-cookie" => [
            [ 0] "NID=62=c2jRl25ItoF5YkVgNv3g2woB2A3iIqkY__EYX5BGst--KYmjNbfCeVL0FIUcq6jm6PqH_-YV6QFO_yNjy1BzMms-QJKPRsfcq0px030WVzKTMtMF9dJUJpS0XdV1NLOv; expires=Fri, 25-Jan-2013 17:50:22 GMT; path=/; domain=.google.com; HttpOnly",
            [ 1] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
            [ 2] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
            [ 3] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
            [ 4] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
            [ 5] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
            [ 6] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com",
            [ 7] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
            [ 8] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
            [ 9] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
            [10] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
            [11] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
            [12] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com",
            [13] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
            [14] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
            [15] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
            [16] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
            [17] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
            [18] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com",
            [19] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
            [20] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
            [21] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
            [22] "expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
            [23] "path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
            [24] "domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com",
            [25] "PREF=ID=51ce2f15ffbc5de1:FF=0:TM=1343325022:LM=1343325022:S=H8-1NoxuEbX7fepF; expires=Sat, 26-Jul-2014 17:50:22 GMT; path=/; domain=.google.com",
            [26] "NID=62=aO6oBKx_v48l5SqQrRDUiNxfOixEE0QnkQIBSZK4u0xS8cHGc7uXTUt6yJhIZTyCe_XWGn6t3-Ov4EvxPE8hAO7I89ao9RR9dLUyYPBB784fR12bJsqbkTaCVaZI7ihT; expires=Fri, 25-Jan-2013 17:50:22 GMT; path=/; domain=.google.com; HttpOnly"
        ],
                    "date" => [
            [0] "Thu, 26 Jul 2012 17:50:22 GMT"
        ],
                 "expires" => [
            [0] "-1"
        ],
           "cache-control" => [
            [0] "private, max-age=0"
        ],
            "content-type" => [
            [0] "text/html; charset=ISO-8859-1"
        ],
                     "p3p" => [
            [0] "CP=\"This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info.\""
        ],
                  "server" => [
            [0] "gws"
        ],
        "x-xss-protection" => [
            [0] "1; mode=block"
        ],
         "x-frame-options" => [
            [0] "SAMEORIGIN"
        ],
              "connection" => [
            [0] "close"
        ]
    },
    attr_reader :http_version = "1.1",
    attr_reader :message = "OK"
>

注意，没有返回 content-length 标头。

访问返回静态页面的网站会给我不同的响应：

request = Net::HTTP.new('tools.ietf.org', 80)
head = request.request_head('/html/rfc2606')

#<Net::HTTPOK:0x100914370
    @body_exist = false,
    @read = true,
    @socket = nil,
    attr_accessor :body = nil,
    attr_reader :code = "200",
    attr_reader :header = {
                    "date" => [
            [0] "Thu, 26 Jul 2012 17:55:23 GMT"
        ],
                  "server" => [
            [0] "Apache/2.2.21 (Debian)"
        ],
        "content-location" => [
            [0] "rfc2606.html"
        ],
                    "vary" => [
            [0] "negotiate"
        ],
                     "tcn" => [
            [0] "choice"
        ],
           "last-modified" => [
            [0] "Sat, 26 May 2012 22:18:00 GMT"
        ],
                    "etag" => [
            [0] "\"d44ff-43da-4c0f7db90d600;4c5bf43471540\""
        ],
           "accept-ranges" => [
            [0] "bytes"
        ],
          "content-length" => [
            [0] "17370"
        ],
              "connection" => [
            [0] "close"
        ],
            "content-type" => [
            [0] "text/html; charset=UTF-8"
        ]
    },
    attr_reader :http_version = "1.1",
    attr_reader :message = "OK"
>

所以，是的，可以判断，但有时您无法从HEAD 请求中获得所需的信息。

过去，我解决这个问题的方法是先尝试 HEAD，如果这不能满足我的需求，那么我会使用普通 GET 检索页面，然后从中获取大小.采用这种方法有助于减少浪费的带宽。

【讨论】：