【发布时间】:2021-07-21 11:19:28
【问题描述】:
我试图在执行正则表达式操作时区分纯字符串文本和有效的 HTML 标记。
我的初始实现:
def html_parser(body, terms:)
doc = Nokogiri::HTML(body)
terms.each do |term|
doc.xpath('//text()').each do |node|
dummy = node.add_previous_sibling(Nokogiri::XML::Node.new('dummy', doc))
dummy.add_previous_sibling(Nokogiri::XML::Text.new(node.to_s.gsub(/\b#{term}\b/, process_term(term)), doc))
node.remove
dummy.remove
end
end
doc.at_css('body').children.to_html.gsub('<', '<').gsub('>', '>').gsub('&lt;', '<').gsub('&gt;', '>')
end
html_parser('hello world', terms: ['hello'])
# After performing the operation, the `doc` is wrapping the string inside the `p` tag automatically, which I do not want.
=> '<p>hello world</p>' # this would be some other value, main problem is wrapping with `p` tag.
但是,这对于其他有效的 HTML 标记也很有效。
string = '<span>hello world<span>'
html_parser(string, terms: ['hello'])
# works fine
【问题讨论】:
标签: ruby-on-rails ruby nokogiri