使用 httpclient 有没有办法通过 HEAD 请求获取页面的字符集？答案

【问题标题】：with httpclient is there a way to get the character set of the page with a HEAD request?使用 httpclient 有没有办法通过 HEAD 请求获取页面的字符集？
【发布时间】：2010-07-09 21:37:06
【问题描述】：

我正在使用 httpclient 库执行基本的 HEAD 请求。我很好奇如何获得 apache 返回的字符集，例如：utf-8、iso-8859-1 等... 谢谢！

  HttpParams httpParams = new BasicHttpParams();
  HttpConnectionParams.setConnectionTimeout(httpParams, 2000);
  HttpConnectionParams.setSoTimeout(httpParams, 2000);

  DefaultHttpClient httpclient = new DefaultHttpClient(httpParams);
  httpclient.getParams().setParameter("http.useragent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)");

  HttpContext localContext = new BasicHttpContext();
  httpget = new HttpHead(url); 

  HttpResponse response = httpclient.execute(httpget, localContext);

  this.sparrowResult.statusCode = response.getStatusLine().getStatusCode();

工作结果更新

Header contentType = response.getFirstHeader("Content-Type");
String charset= contentType.getValue();

【问题讨论】：

标签： java httpclient

【解决方案1】：

如果使用 HttpClient 4.2

import java.nio.charset.Charset;
import org.apache.http.entity.ContentType;

ContentType contentType = ContentType.getOrDefault(entity);
Charset charSet = contentType.getCharset();

【讨论】：

【解决方案2】：

如果使用 HttpClient 4.1（最新）：

import org.apache.http.protocol.HTTP;
import org.apache.http.util.EntityUtils;

String charset = EntityUtils.getContentCharSet(entity);
if (charset == null) {
    charset = HTTP.DEFAULT_CONTENT_CHARSET;
}

【讨论】：

仅供参考 - getContentCharSet 现在已弃用。

【解决方案3】：

在 HTTP 1.1 中，字符集在 Content-Type Header 中

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8

所以应该埋在里面

HttpResponse.Headers

所以，这应该可以工作

HttpResponse.Headers.["Content-Type"]

**没有测试这个，但你明白了

【讨论】：

【解决方案4】：

在某些情况下，服务器不会在标题中为您提供字符集，而是将其写入内容中，例如这个网址：http://seniv.dlmostil.ru/jacket/p/kupit-sportivnie-bryki-adidas-s-dostavkoy/

当你这样做时

ContentType contentType = ContentType.getOrDefault(entity); 
Charset charSet = contentType.getCharset();

那么 charSet 是 null。

在这种情况下，我读取流并尝试使用正则表达式从 html 代码中提取 charSet，因此当您将输入流中的内容读取到

ByteArrayOutputStream out = new ByteArrayOutputStream();

那么你可以这样做：

String help = new String(out.toByteArray());
Pattern charSet = Pattern.compile("charset\\s*=\\s*\"?(.*?)[\";\\>]", Pattern.CASE_INSENSITIVE);
Matcher m = charSet.matcher(help);
String encoding = m.find() ? m.group(1).trim() : "UTF-8";
if (Charset.availableCharsets().get(encoding) == null) encoding = Charsets.UTF_8.toString();
String html = new String(out.toByteArray(), encoding);

当所有其他方法都不起作用时，我希望你能理解最后一个退出。

【讨论】：