如何在Java中截断n个单词后的字符串？答案

【问题标题】：How to truncate a string after n words in Java?如何在Java中截断n个单词后的字符串？
【发布时间】：2013-04-04 00:54:26
【问题描述】：

有没有一个库有一个在 n 个单词后截断字符串的例程？我正在寻找可以转动的东西：

truncateAfterWords(3, "hello, this\nis a long sentence");

进入

"hello, this\nis"

我可以自己写，但我认为类似的东西可能已经存在于某些开源字符串操作库中。

以下是我希望任何解决方案都能通过的测试用例的完整列表：

import java.util.regex.*;

public class Test {

    private static final TestCase[] TEST_CASES = new TestCase[]{
        new TestCase(5, null, null),
        new TestCase(5, "", ""),
        new TestCase(5, "single", "single"),
        new TestCase(1, "single", "single"),
        new TestCase(0, "single", ""),
        new TestCase(2, "two words", "two words"),
        new TestCase(1, "two words", "two"),
        new TestCase(0, "two words", ""),
        new TestCase(2, "line\nbreak", "line\nbreak"),
        new TestCase(1, "line\nbreak", "line"),
        new TestCase(2, "multiple  spaces", "multiple  spaces"),
        new TestCase(1, "multiple  spaces", "multiple"),
        new TestCase(3, " starts with space", " starts with space"),
        new TestCase(2, " starts with space", " starts with"),
        new TestCase(10, "A full sentence, with puncutation.", "A full sentence, with puncutation."),
        new TestCase(4, "A full sentence, with puncutation.", "A full sentence, with"),
        new TestCase(50, "Testing a very long number of words in the testcase to see if the solution performs well in such a situation.  Some solutions don't do well with lots of input.", "Testing a very long number of words in the testcase to see if the solution performs well in such a situation.  Some solutions don't do well with lots of input."),
    };

    public static void main(String[] args){
        for (TestCase t: TEST_CASES){
            try {
                String r = truncateAfterWords(t.n, t.s);
                if (!t.equals(r)){
                    System.out.println(t.toString(r));
                }
            } catch (Exception x){
                System.out.println(t.toString(x));
            }       
        }   
    }

    public static String truncateAfterWords(int n, String s) {
        // TODO: implementation
        return null;
    }
}


class TestCase {
    public int n;
    public String s;
    public String e;

    public TestCase(int n, String s, String e){
        this.n=n;
        this.s=s;
        this.e=e;
    }

    public String toString(){
        return "truncateAfterWords(" + n + ", " + toJavaString(s) + ")\n  expected: " + toJavaString(e);
    }

    public String toString(String r){
        return this + "\n  actual:   " + toJavaString(r) + "";
    }

    public String toString(Exception x){
        return this + "\n  exception: " + x.getMessage();
    }    

    public boolean equals(String r){
        if (e == null && r == null) return true;
        if (e == null) return false;
        return e.equals(r);
    }   

    public static final String escape(String s){
        if (s == null) return null;
        s = s.replaceAll("\\\\","\\\\\\\\");
        s = s.replaceAll("\n","\\\\n");
        s = s.replaceAll("\r","\\\\r");
        s = s.replaceAll("\"","\\\\\"");
        return s;
    }

    private static String toJavaString(String s){
        if (s == null) return "null";
        return " \"" + escape(s) + "\"";
    }
}

此网站上有其他语言的解决方案：

【问题讨论】：

我不认为有这样的功能，看起来很特别。
你可以使用split()，在“”处分割单词，然后计数，超过3个就丢弃其余的。但是不，我从来没有遇到过这样的东西。
我想过拆分，但它往往会扔掉你拆分的东西。我想保留字符串中的空格和换行符。
而不是使用String.spilt()，我更喜欢使用Scanner 类next()。作为spilt() 。阅读更多link
我在下面的回答也适用于您编辑的输入字符串hello, this\nis a long sentence。

标签： java string

【解决方案1】：

您可以使用简单的基于正则表达式的解决方案：

private String truncateAfterWords(int n, String str) {
   return str.replaceAll("^((?:\\W*\\w+){" + n + "}).*$", "$1");    
}

现场演示：http://ideone.com/Nsojc7

更新：根据您的 cmets 解决性能问题：

在处理大量单词时使用以下方法以获得更快的性能：

private final static Pattern WB_PATTERN = Pattern.compile("(?<=\\w)\\b");

private String truncateAfterWords(int n, String s) {
   if (s == null) return null;
   if (n <= 0) return "";
   Matcher m = WB_PATTERN.matcher(s);
   for (int i=0; i<n && m.find(); i++);
   if (m.hitEnd())
      return s;
   else
      return s.substring(0, m.end());
}

【讨论】：

不幸的是，这个解决方案的性能是有问题的。这是一个似乎进入无限循环的测试用例：truncateAfterWords(50, "Testing test testing as a test of testing testing more test.")
@StephenOstermiller：现在检查我的更新。
那不编译 -- start 没有定义。我以为您可能指的是 m.sart() ，但是当它终止时会引发异常，因为没有找到更多匹配项。
我得到了一个类似于您的第二个解决方案的版本，并将其作为解决方案发布在这里：stackoverflow.com/a/16049290/1145388
哦，对不起，对我来说已经很晚了，实际上应该是m.end()。再次编辑，请立即检查。

【解决方案2】：

我找到了一种使用java.text.BreakIterator 类的方法：

private static String truncateAfterWords(int n, String s) {
    if (s == null) return null;
    BreakIterator wb = BreakIterator.getWordInstance();
    wb.setText(s);
    int pos = 0;
    for (int i = 0; i < n && pos != BreakIterator.DONE && pos < s.length();) {
        if (Character.isLetter(s.codePointAt(pos))) i++;
        pos = wb.next();
    }
    if (pos == BreakIterator.DONE || pos >= s.length()) return s;
    return s.substring(0, pos);
}

【讨论】：

【解决方案3】：

这是一个使用正则表达式在循环中查找下一组空格的版本，直到它有足够的单词。类似于 BreakIterator 解决方案，但使用正则表达式来迭代单词中断。

// Any number of white space or the end of the input
private final static Pattern SPACES_PATTERN = Pattern.compile("\\s+|\\z");

private static String truncateAfterWords(int n, String s) {
    if (s == null) return null;
    Matcher matcher = SPACES_PATTERN.matcher(s);
    int matchStartIndex = 0, matchEndIndex = 0, wordsFound = 0;
    // Keep matching until enough words are found, 
    // reached the end of the string, 
    // or no more matches
    while (wordsFound<n && matchEndIndex<s.length() && matcher.find(matchEndIndex)){
        // Keep track of both the start and end of each match
        matchStartIndex = matcher.start();
        matchEndIndex = matchStartIndex + matcher.group().length();
        // Only increment words found when not at the beginning of the string
        if (matchStartIndex != 0) wordsFound++;
    }
    // From the beginning of the string to the start of the final match
    return s.substring(0, matchStartIndex);
}

【讨论】：

【解决方案4】：

尝试在 Java 中使用正则表达式。只检索 n 个单词的正则表达式是：(.*?\s){n}。

尝试使用代码：

String inputStr= "hello, this\nis a long sentence";
Pattern pattern = Pattern.compile("(.*?[\\s]){3}", Pattern.DOTALL); 
Matcher matcher = pattern.matcher(inputStr);
matcher.find(); 
String result = matcher.group(); 
System.out.println(result);

要了解更多关于包的信息：

【讨论】：

好主意，但是那个正则表达式对我不起作用。这不会产生任何输出：Matcher m = Pattern.compile("(.*?\\b){3}").matcher("hello, this is a long sentence");m.find();System.out.println(m.group(0));
使用这段代码@StephenOstermiller: 它工作了.... String inputStr="你好，这是一个很长的句子";模式 pattern = Pattern.compile("(.*?[\\s\\n]){3}", Pattern.DOTALL);匹配器 matcher = pattern.matcher(inputStr); matcher.find();字符串结果 = matcher.group(); System.out.println(结果);
我写了一整套测试用例并添加到问题中。该解决方案使其中几个失败，并且在长输入时进入无限循环。
抱歉延迟回复。将正则表达式用作 (.*[\\s\\n]{1,})。前面的模型是一个关于如何解决问题的例子，但不是一个完整的正则表达式。谢谢
出现拼写错误。使用此正则表达式 (.*?[\\s\\n]){1,}