java中的html截断器答案

【问题标题】：html truncator in javajava中的html截断器
【发布时间】：2011-01-30 13:58:25
【问题描述】：

是否有任何实用程序（或示例源代码）可以在 Java 中截断 HTML（用于预览）？我想在服务器上而不是在客户端上进行截断。

我正在使用 HTMLUnit 来解析 HTML。

更新：
我希望能够预览 HTML，因此截断器将保持 HTML 结构，同时在所需的输出长度之后剥离元素。

【问题讨论】：

也许你可以解释一下你想让一个“截断器”对你的 html 做什么（除了“让它更短:)）。你在寻找什么特定的功能？
在主帖的 UPDATE 下添加了评论。
我仍然很难理解/想象你的意思。您不是说您只想显示仅 html.substring(0, someMaxLength); 并且仍然在有效标记中吗？
@BalusC - 一个不会破坏 HTML 标签或实体的 java 实用程序。

【解决方案1】：

public class SimpleHtmlTruncator {

    public static String truncateHtmlWords(String text, int max_length) {
        String input = text.trim();
        if (max_length > input.length()) {
            return input;
        }
        if (max_length < 0) {
            return new String();
        }
        StringBuilder output = new StringBuilder();
        /**
         * Pattern pattern_opentag = Pattern.compile("(<[^/].*?[^/]>).*");
         * Pattern pattern_closetag = Pattern.compile("(</.*?[^/]>).*"); Pattern
         * pattern_selfclosetag = Pattern.compile("(<.*?/>).*");*
         */
        String HTML_TAG_PATTERN = "<(\"[^\"]*\"|'[^']*'|[^'\">])*>";
        Pattern pattern_overall = Pattern.compile(HTML_TAG_PATTERN + "|" + "\\s*\\w*\\s*");
        Pattern pattern_html = Pattern.compile("(" + HTML_TAG_PATTERN + ")" + ".*");
        Pattern pattern_words = Pattern.compile("(\\s*\\w*\\s*).*");
        int characters = 0;
        Matcher all = pattern_overall.matcher(input);
        while (all.find()) {
            String matched = all.group();
            Matcher html_matcher = pattern_html.matcher(matched);
            Matcher word_matcher = pattern_words.matcher(matched);
            if (html_matcher.matches()) {
                output.append(html_matcher.group());
            } else if (word_matcher.matches()) {
                if (characters < max_length) {
                    String word = word_matcher.group();
                    if (characters + word.length() < max_length) {
                        output.append(word);
                    } else {
                        output.append(word.substring(0,
                                (max_length - characters) > word.length()
                                ? word.length() : (max_length - characters)));
                    }
                    characters += word.length();
                }
            }
        }
        return output.toString();
    }

    public static void main(String[] args) {
        String text = SimpleHtmlTruncator.truncateHtmlWords("<html><body><br/><p>abc</p><p>defghij</p><p>ghi</p></body></html>", 4);
        System.out.println(text);
    }
}

【讨论】：

您能解释一下为什么以及这如何回答这个问题吗？
在这个 sn-p 中，内容被截断并保持 html 结构。所以它回答了这个问题。函数的第二个参数是最大字符长度。

【解决方案2】：

我找到了这个博客：dencat: Truncating HTML in Java

它包含一个python的java端口，Django模板函数truncate_html_words

【讨论】：

【解决方案3】：

我已经编写了另一个 java 版本的 truncateHTML。此函数将字符串截断为多个字符，同时保留整个单词和 HTML 标记。

public static String truncateHTML(String text, int length, String suffix) {
    // if the plain text is shorter than the maximum length, return the whole text
    if (text.replaceAll("<.*?>", "").length() <= length) {
        return text;
    }
    String result = "";
    boolean trimmed = false;
    if (suffix == null) {
        suffix = "...";
    }

    /*
     * This pattern creates tokens, where each line starts with the tag.
     * For example, "One, <b>Two</b>, Three" produces the following:
     *     One,
     *     <b>Two
     *     </b>, Three
     */
    Pattern tagPattern = Pattern.compile("(<.+?>)?([^<>]*)");

    /*
     * Checks for an empty tag, for example img, br, etc.
     */
    Pattern emptyTagPattern = Pattern.compile("^<\\s*(img|br|input|hr|area|base|basefont|col|frame|isindex|link|meta|param).*>$");

    /*
     * Modified the pattern to also include H1-H6 tags
     * Checks for closing tags, allowing leading and ending space inside the brackets
     */
    Pattern closingTagPattern = Pattern.compile("^<\\s*/\\s*([a-zA-Z]+[1-6]?)\\s*>$");

    /*
     * Modified the pattern to also include H1-H6 tags
     * Checks for opening tags, allowing leading and ending space inside the brackets
     */
    Pattern openingTagPattern = Pattern.compile("^<\\s*([a-zA-Z]+[1-6]?).*?>$");

    /*
     * Find &nbsp; &gt; ...
     */
    Pattern entityPattern = Pattern.compile("(&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};)");

    // splits all html-tags to scanable lines
    Matcher tagMatcher =  tagPattern.matcher(text);
    int numTags = tagMatcher.groupCount();

    int totalLength = suffix.length();
    List<String> openTags = new ArrayList<String>();

    boolean proposingChop = false;
    while (tagMatcher.find()) {
        String tagText = tagMatcher.group(1);
        String plainText = tagMatcher.group(2);

        if (proposingChop &&
                tagText != null && tagText.length() != 0 &&
                plainText != null && plainText.length() != 0) {
            trimmed = true;
            break;
        }

        // if there is any html-tag in this line, handle it and add it (uncounted) to the output
        if (tagText != null && tagText.length() > 0) {
            boolean foundMatch = false;

            // if it's an "empty element" with or without xhtml-conform closing slash
            Matcher matcher = emptyTagPattern.matcher(tagText);
            if (matcher.find()) {
                foundMatch = true;
                // do nothing
            }

            // closing tag?
            if (!foundMatch) {
                matcher = closingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    foundMatch = true;
                    // delete tag from openTags list
                    String tagName = matcher.group(1);
                    openTags.remove(tagName.toLowerCase());
                }
            }

            // opening tag?
            if (!foundMatch) {
                matcher = openingTagPattern.matcher(tagText);
                if (matcher.find()) {
                    // add tag to the beginning of openTags list
                    String tagName = matcher.group(1);
                    openTags.add(0, tagName.toLowerCase());
                }
            }

            // add html-tag to result
            result += tagText;
        }

        // calculate the length of the plain text part of the line; handle entities (e.g. &nbsp;) as one character
        int contentLength = plainText.replaceAll("&[0-9a-z]{2,8};|&#[0-9]{1,7};|[0-9a-f]{1,6};", " ").length();
        if (totalLength + contentLength > length) {
            // the number of characters which are left
            int numCharsRemaining = length - totalLength;
            int entitiesLength = 0;
            Matcher entityMatcher = entityPattern.matcher(plainText);
            while (entityMatcher.find()) {
                String entity = entityMatcher.group(1);
                if (numCharsRemaining > 0) {
                    numCharsRemaining--;
                    entitiesLength += entity.length();
                } else {
                    // no more characters left
                    break;
                }
            }

            // keep us from chopping words in half
            int proposedChopPosition = numCharsRemaining + entitiesLength;
            int endOfWordPosition = plainText.indexOf(" ", proposedChopPosition-1);
            if (endOfWordPosition == -1) {
                endOfWordPosition = plainText.length();
            }
            int endOfWordOffset = endOfWordPosition - proposedChopPosition;
            if (endOfWordOffset > 6) { // chop the word if it's extra long
                endOfWordOffset = 0;
            }

            proposedChopPosition = numCharsRemaining + entitiesLength + endOfWordOffset;
            if (plainText.length() >= proposedChopPosition) {
                result += plainText.substring(0, proposedChopPosition);
                proposingChop = true;
                if (proposedChopPosition < plainText.length()) {
                    trimmed = true;
                    break; // maximum length is reached, so get off the loop
                }
            } else {
                result += plainText;
            }
        } else {
            result += plainText;
            totalLength += contentLength;
        }
        // if the maximum length is reached, get off the loop
        if(totalLength >= length) {
            trimmed = true;
            break;
        }
    }

    for (String openTag : openTags) {
        result += "</" + openTag + ">";
    }
    if (trimmed) {
        result += suffix;
    }
    return result;
}

【讨论】：

【解决方案4】：

这里有一个 PHP 函数：http://snippets.dzone.com/posts/show/7125

我已经为初始版本制作了一个快速而肮脏的 Java 端口，但是在 cmets 中还有值得考虑的后续改进版本（尤其是处理整个单词的版本）：

public static String truncateHtml(String s, int l) {
  Pattern p = Pattern.compile("<[^>]+>([^<]*)");

  int i = 0;
  List<String> tags = new ArrayList<String>();

  Matcher m = p.matcher(s);
  while(m.find()) {
      if (m.start(0) - i >= l) {
          break;
      }

      String t = StringUtils.split(m.group(0), " \t\n\r\0\u000B>")[0].substring(1);
      if (t.charAt(0) != '/') {
          tags.add(t);
      } else if ( tags.get(tags.size()-1).equals(t.substring(1))) {
          tags.remove(tags.size()-1);
      }
      i += m.start(1) - m.start(0);
  }

  Collections.reverse(tags);
  return s.substring(0, Math.min(s.length(), l+i))
      + ((tags.size() > 0) ? "</"+StringUtils.join(tags, "></")+">" : "")
      + ((s.length() > l) ? "\u2026" : "");

}

注意：StringUtils.join() 需要 Apache Commons Lang。

【讨论】：

【解决方案5】：

我可以为您提供我为此编写的 Python 脚本：http://www.ellipsix.net/ext-tmp/summarize.txt。不幸的是，我没有 Java 版本，但如果您愿意，可以自己翻译并修改它以满足您的需要。它不是很复杂，只是我为我的网站拼凑的东西，但我已经使用了一年多一点，它通常看起来效果很好。

如果您想要一些健壮的东西，XML（或 SGML）解析器几乎肯定比我做的更好。

【讨论】：

@David - 谢谢，我去看看。

【解决方案6】：

我认为您将需要编写自己的 XML 解析器来完成此操作。拉出正文节点，添加节点直到二进制长度tagsoup。

如果您需要 XML 解析器/处理程序，我建议您使用 XOM。

【讨论】：

我想这就是我必须做的。想看看那里是否还有其他东西......
我以前从未听说有人需要这样做，所以我想这就是为什么没有（至少容易找到）解决方案的原因。
另外，至少使用 XOM，您可以很容易地检查图表的长度。 root.toXML().getBytes().length() 将返回当前 XML 树的字符串表示的字节数。如果您以增量方式构建树，则可以在每一步检查字节并在字节 > 所需字节时恢复。