有没有办法从 FCKEditor 中删除所有不必要的 MS Word 格式答案

【问题标题】：is there a Way to strip all Unnecessary MS Word Formatting from FCKEditor有没有办法从 FCKEditor 中删除所有不必要的 MS Word 格式
【发布时间】：2010-11-23 21:42:57
【问题描述】：

我已经安装了 fckeditor，当从 MS Word 粘贴时，它添加了很多不必要的格式。我想保留某些东西，比如粗体、斜体、粗体字等等。我已经在网上搜索并提出了解决方案，这些解决方案可以删除所有内容，即使是我想要保留的东西，比如粗体和斜体。有没有办法去掉不必要的单词格式？

【问题讨论】：

任何维护过 CMS 的人都知道您所说的邪恶。祝你好运找到答案。我们只是让它们从 word 中粘贴，然后我有一个程序可以从数据库中删除无法显示的字符。

标签： c# asp.net javascript fckeditor

【解决方案1】：

以防万一有人想要接受答案的 c# 版本：

public string CleanHtml(string html)
    {
        //Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
        // Only returns acceptable HTML, and converts line breaks to <br />
        // Acceptable HTML includes HTML-encoded entities.

        html = html.Replace("&" + "nbsp;", " ").Trim(); //concat here due to SO formatting
        // Does this have HTML tags?

        if (html.IndexOf("<") >= 0)
        {
            // Make all tags lowercase
            html = Regex.Replace(html, "<[^>]+>", delegate(Match m){
                return m.ToString().ToLower();
            });
            // Filter out anything except allowed tags
            // Problem: this strips attributes, including href from a
            // http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
            string AcceptableTags = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote";
            string WhiteListPattern = "</?(?(?=" + AcceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
            html = Regex.Replace(html, WhiteListPattern, "", RegexOptions.Compiled);
            // Make all BR/br tags look the same, and trim them of whitespace before/after
            html = Regex.Replace(html, @"\s*<br[^>]*>\s*", "<br />", RegexOptions.Compiled);
        }


         // No CRs
         html = html.Replace("\r", "");
         // Convert remaining LFs to line breaks
         html = html.Replace("\n", "<br />");
         // Trim BRs at the end of any string, and spaces on either side
         return Regex.Replace(html, "(<br />)+$", "", RegexOptions.Compiled).Trim();
    }

【讨论】：

【解决方案2】：

这是我用来从富文本编辑器清除传入 HTML 的解决方案...它是用 VB.NET 编写的，我没有时间转换为 C#，但它非常简单：

 Public Shared Function CleanHtml(ByVal html As String) As String
     '' Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
     '' Only returns acceptable HTML, and converts line breaks to <br />
     '' Acceptable HTML includes HTML-encoded entities.
     html = html.Replace("&" & "nbsp;", " ").Trim() ' concat here due to SO formatting
     '' Does this have HTML tags?
     If html.IndexOf("<") >= 0 Then
         '' Make all tags lowercase
         html = RegEx.Replace(html, "<[^>]+>", AddressOf LowerTag)
         '' Filter out anything except allowed tags
         '' Problem: this strips attributes, including href from a
         '' http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
         Dim AcceptableTags      As String   = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
         Dim WhiteListPattern    As String   = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
         html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
         '' Make all BR/br tags look the same, and trim them of whitespace before/after
         html = RegEx.Replace(html, "\s*<br[^>]*>\s*", "<br />", RegExOptions.Compiled)
     End If
     '' No CRs
     html = html.Replace(controlChars.CR, "")
     '' Convert remaining LFs to line breaks
     html = html.Replace(controlChars.LF, "<br />")
     '' Trim BRs at the end of any string, and spaces on either side
     Return RegEx.Replace(html, "(<br />)+$", "", RegExOptions.Compiled).Trim()
 End Function

 Public Shared Function LowerTag(m As Match) As String
   Return m.ToString().ToLower()
 End Function

在您的情况下，您需要修改“AcceptableTags”中的“已批准”HTML 标记列表——代码仍将删除所有无用的属性（不幸的是，希望 HREF 和 SRC 等有用的属性这些对你来说并不重要）。

当然，这需要访问服务器。如果您不希望这样，则需要在工具栏上添加某种“清理”按钮，该按钮调用 JavaScript 以弄乱编辑器的当前文本。不幸的是，“粘贴”不是一个可以被捕获以自动清理标记的事件，并且在每次 OnChange 之后进行清理会导致编辑器无法使用（因为更改标记会更改文本光标位置）。

【讨论】：

哇..这太棒了。但我确实需要链接和基本的 html 标签

【解决方案3】：

尝试了接受的解决方案，但它没有清除单词生成的标签。

但是this code 为我工作

静态字符串 CleanWordHtml(string html) {

StringCollection sc = new StringCollection();
// get rid of unnecessary tag spans (comments and title)
sc.Add(@"<!--(\w|\W)+?-->");
sc.Add(@"<title>(\w|\W)+?</title>");
// Get rid of classes and styles
sc.Add(@"\s?class=\w+");
sc.Add(@"\s+style='[^']+'");
// Get rid of unnecessary tags
sc.Add(
@"<(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^>]*?>");
// Get rid of empty paragraph tags
sc.Add(@"(<[^>]+>)+&nbsp;(</\w+>)+");
// remove bizarre v: element attached to <img> tag
sc.Add(@"\s+v:\w+=""[^""]+""");
// remove extra lines
sc.Add(@"(\n\r){2,}");
foreach (string s in sc)
{
    html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
}
return html; 
}

【讨论】：

【解决方案4】：

我非常了解这个问题。当从 MS-Word（或任何文字处理或富文本编辑感知文本区域）复制出来然后粘贴到 FCKEditor（TinyMCE 也会出现同样的问题）时，原始标记包含在剪贴板中的内容中并得到处理。此标记并不总是与嵌入到粘贴操作目标中的标记互补。

除了成为 FCKEditor 的贡献者并研究代码并进行修改之外，我不知道解决方案。我通常做的是指导用户执行两阶段剪贴板操作。

从 MS-Word 复制
粘贴到记事本中
全选
从记事本复制
粘贴到 FCKEEditor

【讨论】：

【解决方案5】：

但正如其名称和网站所暗示的那样，fckeditor 是一个文本编辑器。对我来说，这意味着它只显示文件中的字符。

如果没有一些额外的字符，就不能使用粗体和斜体格式。

编辑：啊，我明白了。仔细看看 Fckeditor 网站，它是一个 HTML 编辑器，而不是我习惯的简单文本编辑器之一。

Paste from Word cleanup with autodetection 被列为一项功能。

【讨论】：

pavium，fckeditor 是一个富文本编辑器，抽象了使用可编辑 DIV 的所有麻烦并添加了漂亮的工具栏。在后台，它存储在 HTML 中，这意味着当有人从 Word 中粘贴时，Word 会传递给它各种 HTML 邪恶。

【解决方案6】：

对于我的解决方案，我结合使用 C# 版本的 CleanHtml 函数和清除 MS Office 标签的部分。本质上是Glenn's 进程的基于代码的版本。我会看看当我将它推送到一个巨大的 Excel 电子表格时会发生什么。

【讨论】：