从 HTML 页面中删除样板内容答案

【问题标题】：Remove boilerplate content from HTML page从 HTML 页面中删除样板内容
【发布时间】：2015-08-29 06:54:46
【问题描述】：

我想使用在这里找到的 jusText 实现https://github.com/miso-belica/jusText 从 html 页面中获取干净的内容。基本上它是这样工作的：

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
  if not paragraph.is_boilerplate:
      print paragraph.text

我已经下载了我想使用这个工具解析的页面（其中一些不再在线提供），我从中提取了 html 内容。由于 jusText 似乎只处理请求的输出（这是一个响应类型对象），我想知道是否有任何自定义方法可以将响应对象的内容设置为包含我想要解析的 html 文本。

【问题讨论】：

标签： python request response htmlcleaner

【解决方案1】：

response.content 属于<type 'str'>

>>> from requests import get
>>> r = get("http://www.google.com/")
>>> type(r.content)
<type 'str'>

所以只需调用：

justext.justext(my_html_string, justext.get_stoplist("English"))

【讨论】：