如何使用嵌套的 html 标签清理字符串但保留 标签？答案

【问题标题】：How to sanitalize string with nested html tags but keep tag?如何使用嵌套的 html 标签清理字符串但保留 标签？
【发布时间】：2015-01-23 09:52:23
【问题描述】：

我正在尝试清理 Solr 搜索结果，因为它里面有 html 标签：

ActionController::Base.helpers.sanitize( result_string )

清理未突出显示的字符串很容易，例如：I know <ul><li>ruby</li> <li>rails</li></ul>。

但是当结果被突出显示时，我在里面有额外的重要标签 -  和 ：

I know <ul><li>ruby</li> <li>rails</li></ul>.

因此，当我使用嵌套的 html 和突出显示标签对字符串进行清理时，我会得到带有 htmls 标签的字符串。这很糟糕:)

如何清理带有 标签的突出显示字符串以获得正确的结果（仅限带有 标签的字符串）？

我找到了路，但它很慢而且不漂亮：

string = 'I <em>know</em> <<em>ul</em>><<em>li</em>><em>ruby</em></<em>li</em>> <<em>li</em>><em>rails</em></<em>li</em>></<em>ul</em>>'

['p', 'ul', 'li', 'ol', 'span', 'b', 'br'].each do |tag| 
  string.gsub!( "<<em>#{tag}</em>>",  '' )
  string.gsub!( "</<em>#{tag}</em>>", '' )
end

string = ActionController::Base.helpers.sanitize string, tags: %w(em)

我怎样才能优化它或使用更好的解决方案呢？编写一些正则表达式并删除 html_tags，但保留  和  例如

请帮忙，谢谢。

【问题讨论】：

我可能不太理解您的问题，但我认为您应该指定要清理的内容以及要清理的内容，而不是通过选项值： ActionController::Base.helpers.sanitize( result_string, tags: %w(em) )

标签： ruby regex gsub html-sanitizing

【解决方案1】：

你可以调用 gsub！丢弃所有标签，但只保留独立的标签，或者不包含在 html 标签中的标签。

result_string.gsub!(/(<\/?[^e][^m]>)|(<<em>\w*<\/em>>)|(<\/<em>\w*<\/em>>)/, '')

会成功的

解释一下：

# first group (<\/?[^e][^m]>) 
# find all html tags that are not <em> or </em>

# second group (<<em>\w*<\/em>>)
# find all opening tags that have <em> </em> inside of them like:
# <<em>li</em>>   or <<em>ul</em>>

# third group (<\/<em>\w*<\/em>>)
# find all closing tags that have <em> </em> inside of them:
# </<em>li</em>>   or  </<em>ul</em>>

# and gsub replaces all of this with empty string

【讨论】：

【解决方案2】：

使用sanitize 的附加参数，您可以指定允许哪些标签。

在您的示例中，尝试：

ActionController::Base.helpers.sanitize( result_string, tags: %w(em) )

它应该可以解决问题

【讨论】：

它没有帮助，因为当我在一些 html 标记内有 时，清理效果不好（'p>' 例如）

【解决方案3】：

我认为你可以使用sinitize：

Custom Use (only the mentioned tags and attributes are allowed, nothing else)
<%= sanitize @article.body, tags: %w(table tr td), attributes: %w(id class style) %>

所以，这样的事情应该可以工作：

sanitize result_string, tags: %w(em)

【讨论】：

如果我清理字符串I know <b>ruby</b>，我将得到结果I know ruby。但我需要得到I know ruby 结果...
所以，我首先需要 - 删除 html 标签内的标签。而不是删除html标签，但保留保留的 ...