Python Web Scraping 只获取主要内容答案

【问题标题】：Python Web Scraping Get the main content onlyPython Web Scraping 只获取主要内容
【发布时间】：2017-03-12 22:08:01
【问题描述】：

import numpy as np
import json 
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://www.npr.org/sections/thetwo-way/2017/03/06/518805720/turkey-germany-relations-at-new-low-after-erdogan-makes-nazi-comparison"

html = urlopen(url)
bsObj = BeautifulSoup(html, 'lxml')


def keyInfo(div):
  print(div.find("h1").get_text())
  print(div.find("span", {"class":"date"}).get_text())
  print(div.find("a", {"rel":"author"}).get_text().strip())
  print(div.findAll("p")) # Problem here

keyInfo(bsObj)

问题是def keyInfo中的最后一行，它打印了很多东西，标签，标题，我只想要文本的主要内容，我怎么能做到这一点？

【问题讨论】：

请重新访问如何在 Stack Overflow 上提问 good question，以便您的问题得到社区的好评。此外，请确保您熟悉如何组合minimal reproducible example。请记住，此处提供的帮助是针对您要解决的问题中的编程问题的明确问题。实际上，这太宽泛了，因为您没有提供足够的信息让读者能够有效地帮助您。
修改了，够清楚了吗？

标签： python text

【解决方案1】：

这段代码可以更好地提取特定网站的内容。

def keyInfo(div):
  print(div.find("h1").get_text())
  article = div.find("article")
  divText = article.find("div", id="storytext")
  [a.extract() for a in divText.findAll("aside")]
  [d.extract() for d in divText.findAll("div")]
  print(divText.get_text())

方法

在使用 Chrome 开发工具查看内容结构后，我注意到故事内容位于 article > div[id=storytext]，但 div[id=storytext] 还包含一些非文章内容的旁白和 div。删除那些离开文章的段落。

寻找更通用的东西？

如果您正在寻找更通用的东西，您可能需要考虑像 Boilerpipe 这样的东西。这是 Boilerpipe 的 Python 包装器：https://github.com/misja/python-boilerpipe

【讨论】：

效果很好，我只修改了最后一行：codeprint(divText.get_text().replace('\n',""))