【问题标题】:How to split paragraph to a sentence in java如何在java中将段落拆分为句子
【发布时间】:2016-10-09 08:39:15
【问题描述】:

我正在尝试将段落拆分为句子。该段落可以有一个像 F.C.B 这样的词,它还包括一些 html 标签,如锚和其他标签。我试图像下面这样使用,但通过按原样使用 html 标记将我的段落与特定句子分开并不完美。

String.split("(?<!\\.[a-zA-Z])\\.(?![a-zA-Z]\\.)(?![<[^>]*>])");  

请问有没有人可以帮助我更好的正则表达式或任何想法?

【问题讨论】:

标签: java html regex string


【解决方案1】:

你可以试试这个:

String par = "In 2004, Obama received national attention during his campaign to represent Illinois in the United States Senate with his victory in the March Democratic Party primary, his keynote address at the Democratic National Convention in July, and his election to the Senate in November. He began his presidential campaign in 2007 and, after a close primary campaign against Hillary Clinton in 2008, he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.";
Pattern pattern = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher matcher = pattern.matcher(par);
while (matcher.find()) {
    System.out.println(matcher.group());
}

让我知道它是否有效

【讨论】:

  • @RoYoMin 你有问题吗?")" 你可以用特殊的转义来逃避或忽略
  • @RoYoMi 它应该可以工作,只是你的文本包含 html 和 res 在这里看到link here。您需要文本中的 html 来格式化它吗?
【解决方案2】:

说明

与其分割字符,不如只匹配和捕获每个句子的子字符串会更容易

(?:&lt;(?:(?:[a-z]+\s(?:[^&gt;=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?|\/[a-z]+)&gt;)|(?:(?!&lt;)(?:[^.?!]|[.?!](?=\S)))*)+[.?!]

此正则表达式将执行以下操作:

  • 匹配每个句子
  • 允许像F.C.B这样的子字符串
  • 忽略 html 标签,但将它们包含在捕获中

注意:您需要转义所有\,使它们看起来像\\

示例

现场演示

https://regex101.com/r/fJ9zS0/3

示例文本

I am was trying to split paragraph to sentences. The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags. I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.

In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November. He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.

示例匹配

Java Code Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = " ----your source string goes here----- ";
  Pattern re = Pattern.compile("(?:<(?:(?:[a-z]+\\s(?:[^>=]|='[^']*'|=\"[^\"]*\"|=[^'\"\\s]*)*\"\\s?\\/?|\\/[a-z]+)>)|(?:(?!<)(?:[^.?!]|[.?!](?=\\S)))*)+[.?!]",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

样本输出

$matches Array:
(
    [0] => Array
        (
            [0] => I am was trying to split paragraph to sentences.
            [1] =>  The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags.
            [2] =>  I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.
            [3] => 

In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November.
            [4] =>  He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.
        )
    )

说明

NODE                     EXPLANATION
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        [a-z]+                   any character of: 'a' to 'z' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
        (?:                      group, but do not capture (0 or more
                                 times (matching the most amount
                                 possible)):
----------------------------------------------------------------------
          [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          ='                       '=\''
----------------------------------------------------------------------
          [^']*                    any character except: ''' (0 or
                                   more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
          '                        '\''
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          ="                       '="'
----------------------------------------------------------------------
          [^"]*                    any character except: '"' (0 or
                                   more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
          "                        '"'
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          =                        '='
----------------------------------------------------------------------
          [^'"\s]*                 any character except: ''', '"',
                                   whitespace (\n, \r, \t, \f, and "
                                   ") (0 or more times (matching the
                                   most amount possible))
----------------------------------------------------------------------
        )*                       end of grouping
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        \s?                      whitespace (\n, \r, \t, \f, and " ")
                                 (optional (matching the most amount
                                 possible))
----------------------------------------------------------------------
        \/?                      '/' (optional (matching the most
                                 amount possible))
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        \/                       '/'
----------------------------------------------------------------------
        [a-z]+                   any character of: 'a' to 'z' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        <                        '<'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        [^.?!]                   any character except: '.', '?', '!'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [.?!]                    any character of: '.', '?', '!'
----------------------------------------------------------------------
        (?=                      look ahead to see if there is:
----------------------------------------------------------------------
          \S                       non-whitespace (all but \n, \r,
                                   \t, \f, and " ")
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  [.?!]                    any character of: '.', '?', '!'
----------------------------------------------------------------------

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2011-01-10
    • 2013-05-21
    • 1970-01-01
    • 2013-08-13
    • 2014-06-01
    • 2020-09-21
    • 2012-03-17
    相关资源
    最近更新 更多