如何在python中过滤html标签？答案

【问题标题】：how can I filter html tags in python?如何在python中过滤html标签？
【发布时间】：2018-07-20 21:13:45
【问题描述】：

这里是html

<section class=\"xmt-style-block\" data-id=\"330057\" data-style-type=\"5\" data-tools=\"3434\">
abc cba abc cba
<p style="margin: 0px;padding: 0px;box-sizing: border-box;">
<br/> pp pp</p></section>
<section class=\"xmt-style-block\" data-id=\"330057\" data-style-type=\"5\" data-tools=\"3434\">abc cba abc cba<p style="margin: 0px;padding: 0px;box-sizing: border-box;"><br/> pp pp</p></section>

我要过滤“class”“data-id”“data-style-type”“data-tools”

只剩下<section>abc cba abc cba <p> pp pp</p></section><section>abc cba abc cba <p> pp pp</p></section>

如何在 python 中做到这一点？谢谢！

【问题讨论】：

HTML 不是代码。它是一种标记语言。
你尝试过什么，你在哪里停留？找到一些 HTML/XML 解析器，然后去做。
为什么这个问题的骗子被删除了？
对不起。我已经更新了我的问题。你能再帮忙吗？

标签： python html filter

【解决方案1】：

从字面上看问题中的数据，这个解决方案并不优雅，但它确实有效。

我将您的 html 放在一个名为 sample_html.html 的文件中，然后在同一个文件夹中创建了一个名为 filter_section.py 的 Python 脚本。下面是这个脚本的代码：

# Get file name of code you want to filter
fname = input("Enter file name: ")
if len(fname) < 1 : fname = "sample_html.html"

# Create a file handle, like opening the file...
f_handle = open(fname)

# Define your new file that you will write to...
f_new= open("new_html.html","w+")

# Loop through each line in file
for line in f_handle:
    # If line doesn't start with what you want to replace then write line to file and continue
    if not line.startswith("<section class"):
        f_new.write(line)
        continue
    # You want to replace this line so split it into list of strings
    strings =  line.split()
    # Set new value for first string in list
    strings[0] = '<section>'
    # Delete the strings you don't want, this is what you are filtering out
    del strings[1:5]
    # Join the strings to a new line and write it to file
    new_line = " ".join(strings)
    f_new.write(new_line)
# Close yor new file
f_new.close()

new_html.html 的输出不会在节元素中包含这些属性

【讨论】：

【解决方案2】：

>>> from bs4 import BeautifulSoup
>>> html = '<section class=\"xmt-style-block\" data-id=\"330057\" data-style-type=\"5\" data-tools=\"3434\">abc cba abc cba</section>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> section = soup.find_all('section')[0]
>>> del section['class'], section['data-id'], section['data-tools'], section['data-style-type']
>>> str(section)
'<section>abc cba abc cba</section>'

您可以调整 soup.find_all('section')[0] 以使用 id 或搜索/循环通过它

【讨论】：

对不起。我已经更新了我的问题。你能再帮忙吗？
@user3751111 你可以很容易地解决这个问题，只需遍历soup.find_all('section')，每次都做同样的事情