如何提取只有 <p> 的 HTML 标记中的内容答案

【问题标题】：How do i extract the contents in HTML tags that only have <p>如何提取只有 <p> 的 HTML 标记中的内容
【发布时间】：2020-09-12 19:52:01
【问题描述】：

我刚开始进行网页抓取，我正在使用 beautifulsoup 来执行网页抓取，但我只想提取带有“p”标签的内容。因此，如果有其他类/样式/等，我想忽略标签...

例子：

<p>what I want to extract</p>

<p class="copy">what I do not want to extract from HTML page</p>

到目前为止，我只能用这段代码提取所有的“p”标签

from bs4 import BeautifulSoup as BS
import requests

URL = input("Enter url to scrape: ")
content = requests.get(URL)
soup = BS(content.text, 'html.parser')
content_p = soup.find_all('p')
print(content_p)

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

你可以试试

soup.findAll(lambda tag: len(tag.name) == 1 and not tag.attrs)

参考 - https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#The%20basic%20find%20method:%20findAll(name,%20attrs,%20recursive,%20text,%20limit,%20**kwargs)

【讨论】：

非常感谢！有没有办法删除所有不需要的编码，如 '\u2060' '\xa0'？
你可以使用这样的东西 - badString = "FooBar
Baz" BeautifulSoup(badString)