如何使用漂亮的汤从 html 文档中获取 <text> 标签答案

【问题标题】：How to get <text> tag from an html document using beautiful soup如何使用漂亮的汤从 html 文档中获取 <text> 标签
【发布时间】：2019-11-07 23:51:06
【问题描述】：

如何使用Abbot lab 10k filing的漂亮汤从html文档中获取<text>标签

我想使用下面的代码提取<text></text>标签的所有子标签名称

from bs4 import BeautifulSoup
import urllib.request
url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'
htmlpage = urllib.request.urlopen(url)
soup = BeautifulSoup(htmlpage, "html.parser")
all_text = soup.find('text')
all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)

但是我从上面的代码中得到的一些输出是['html']。

预期输出：
['p','p','p','p','p','p','div','div','font','font', etc......]

【问题讨论】：

标签： python html python-3.x beautifulsoup

【解决方案1】：

您可以使用 CSS 选择器（用于打印标签文本的所有子项）：

for child in all_text.select('text *'):
    print(child.name, end=' ')

打印：

br p font font b p font b br p font b div div ...

编辑：对于仅打印标签文本的 direct 子级，您可以使用：

from bs4 import BeautifulSoup
import requests

url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'

htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")

for child in soup.select('text > *'):
    print(child.name, end=' ')

【讨论】：

@Shijith 我更新了我的代码。我正在使用beautifulsoup4==4.7.1
问题出在我用来获取页面的urllib.request.urlopen(url) 和我用来解析的html.parser 上。更改为requests.get(url)，解析器更改为lxml，现在可以正常工作了。谢谢。

【解决方案2】：

替换你的代码：

all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)

到：

all_tags = [x.name for x in all_text.findChildren() if x.name is not None]
print(all_tags)

findChildren() more details

【讨论】：

这将递归地进入<text> 的每个子标签。我只想要 <text> 的所有子标签的名称，不包括子标签的子标签的名称。
@Shijith 如果你调试网站源码，你会看到<TEXT><HTML><HEAD></HEAD><BODY></TEXT> DOM 结构。
@Shijith 你应该废弃孩子的body标签。