使用 BeautifulSoup 从 `div` 中的 `p` 中提取文本答案

【问题标题】：Extract the text from `p` within `div` with BeautifulSoup使用 BeautifulSoup 从 `div` 中的 `p` 中提取文本
【发布时间】：2016-08-12 08:08:25
【问题描述】：

我对使用 Python 进行网络抓取非常陌生，而且我真的很难从 HTML 中提取嵌套文本（确切地说是div 中的p）。这是我到目前为止得到的：

from bs4 import BeautifulSoup
import urllib

url = urllib.urlopen('http://meinparlament.diepresse.com/')
content = url.read()
soup = BeautifulSoup(content, 'lxml')

这很好用：

links=soup.findAll('a',{'title':'zur Antwort'})
for link in links:
    print(link['href'])

此提取工作正常：

table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
    print(x)

这是输出：

<div class="content-question">
<p>[...] Die Verhandlungen über die mögliche Visabefreiung für    
türkische Staatsbürger per Ende Ju...
<a href="http://meinparlament.diepresse.com/frage/10144/" title="zur 
Antwort">mehr »</a>
</p>
</div>

现在，我想提取p 和/p 中的文本。这是我使用的代码：

table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
    print(x['p'])

但是，Python 会引发 KeyError。

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

以下代码使用class“内容问题”查找并打印div中每个p元素的文本

from bs4 import BeautifulSoup
import urllib

url = urllib.urlopen('http://meinparlament.diepresse.com/')
content = url.read()
soup = BeautifulSoup(content, 'lxml')

table = soup.findAll('div',attrs={"class":"content-question"})
for x in table:
    print x.find('p').text

# Another way to retrieve tables:
# table = soup.select('div[class="content-question"]')

以下是table中第一个p元素的打印文本：

[...] Die Verhandlungen über die mögliche Visabefreiung für türkische Staatsbürger per Ende Juni sind noch nicht abgeschlossen, sodass nicht mit Sicherheit gesagt werden kann, ob es zu diesem Zeitpunkt bereits zu einer Visabefreiung kommt。 Auch die genauen Modalitäten einer solchen Visaliberalisierung sind noch nicht ausverhandelt。 Prinzipiell ist es jedoch so, dass Visaerleichterungen bzw。 -liberalisierungen eine Frage von Reziprozität sind, d.h. dass diese für beide Staaten gelten müssten。 [...]

【讨论】：

此解决方案假定页面上使用的 HTML 正确地将所有段落包含在“p”元素对中。但情况往往不是这样，有时用empy p 元素来分割文本，有时有首文本，后面是段落跨度，后面是尾随文本，其中首尾或尾随文本不包含在自己的段落跨度中等等。上面的解决方案将只返回由一对打开/关闭 p 元素包围的文本，而不是它们之间的文本。有什么办法可以得到一切？
你好 Philip - 我在 ATOM 的 MX-Linux 上运行这段代码，但不幸的是我没有得到任何结果 - 知道吗！？