这是我用来从富文本编辑器清除传入 HTML 的解决方案...它是用 VB.NET 编写的,我没有时间转换为 C#,但它非常简单:
Public Shared Function CleanHtml(ByVal html As String) As String
'' Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
'' Only returns acceptable HTML, and converts line breaks to <br />
'' Acceptable HTML includes HTML-encoded entities.
html = html.Replace("&" & "nbsp;", " ").Trim() ' concat here due to SO formatting
'' Does this have HTML tags?
If html.IndexOf("<") >= 0 Then
'' Make all tags lowercase
html = RegEx.Replace(html, "<[^>]+>", AddressOf LowerTag)
'' Filter out anything except allowed tags
'' Problem: this strips attributes, including href from a
'' http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
Dim AcceptableTags As String = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
'' Make all BR/br tags look the same, and trim them of whitespace before/after
html = RegEx.Replace(html, "\s*<br[^>]*>\s*", "<br />", RegExOptions.Compiled)
End If
'' No CRs
html = html.Replace(controlChars.CR, "")
'' Convert remaining LFs to line breaks
html = html.Replace(controlChars.LF, "<br />")
'' Trim BRs at the end of any string, and spaces on either side
Return RegEx.Replace(html, "(<br />)+$", "", RegExOptions.Compiled).Trim()
End Function
Public Shared Function LowerTag(m As Match) As String
Return m.ToString().ToLower()
End Function
在您的情况下,您需要修改“AcceptableTags”中的“已批准”HTML 标记列表——代码仍将删除所有无用的属性(不幸的是,希望 HREF 和 SRC 等有用的属性这些对你来说并不重要)。
当然,这需要访问服务器。如果您不希望这样,则需要在工具栏上添加某种“清理”按钮,该按钮调用 JavaScript 以弄乱编辑器的当前文本。不幸的是,“粘贴”不是一个可以被捕获以自动清理标记的事件,并且在每次 OnChange 之后进行清理会导致编辑器无法使用(因为更改标记会更改文本光标位置)。