如何检测缩写中使用的'和引号之间的区别答案

【问题标题】：How to detect the difference between ' as used in an abbreviation and as quotation markers如何检测缩写中使用的'和引号之间的区别
【发布时间】：2012-05-18 11:07:10
【问题描述】：

我正在尝试解析文本块，并且需要一种方法来检测不同上下文中撇号之间的差异。一组是所有和缩写，另一组是引用。

例如

“我是汽车的所有者”-> [“我是”、“the”、“汽车”、“所有者”]

但是

“他说‘你好’” -> [“他”，“说”，“你好””]

检测两边的空格无济于事，因为像“'ello”和“cars'”这样的东西会被解析为引号的一端，与匹配的撇号对相同。我感觉除了极其复杂的 NLP 解决方案之外别无他法，我将不得不忽略任何没有出现在单词中间的撇号，这将是不幸的。

编辑：

自从写作以来，我意识到这是不可能的。任何基于正则表达式的解析器都必须解析：

你好，我的伙伴们的狗

有两种不同的方式，只有在理解了句子的其余部分后才能做到这一点。猜猜我赞成忽略最不可能的情况并希望它足够罕见而只会导致不常见的异常的不优雅的解决方案。

【问题讨论】：

与所有格的数量相比，收缩的数量相对较少。
在英国等方言中，绝对是。当然还有其他词在前面有一个缩略词，尽管许多习惯上是不带撇号的；但你偶尔会看到'phone（电话）、'cello（大提琴）等。
人们在某些情况下正确使用标点符号（'hello、'phone 等）和在其他情况下不正确使用标点符号（使用 ' 而不是 "）是一个问题。如果我们只能坚持一个或另一个解析会很容易。

标签： ruby regex parsing

【解决方案1】：

嗯，恐怕这并不容易。这是一个有点用的正则表达式，可惜只适用于“I'm”和“I've”之类的东西：

>> s1 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> nil
>> s2 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> 0
>> $1
=> "'hello there'"

如果你多尝试一下，你也许可以消除一些其他常见的宫缩，但总比没有好。

【讨论】：

【解决方案2】：

需要考虑的一些规则：

引号将以撇号开头，带有空格字符或前面没有任何字符。
引号将以带标点符号的撇号或后面的空格字符结尾。
有些词可能看起来像引号的结尾，例如，peoples'。
引号分隔撇号的前后永远不会有字母。

【讨论】：

【解决方案3】：

使用非常简单的两阶段过程。

在 pass 1 of 2 中，从这个正则表达式开始，将文本分解为单词和非单词字符的交替段。

/(\w+)|(\W+)/gi

将匹配项存储在这样的列表中（我使用的是 AS3 样式的伪代码，因为我不使用 ruby）：

class MatchedWord
{
    var text:String;
    var charIndex:int;
    var isWord:Boolean;
    var isContraction:Boolean = false;
    function MatchedWord( text:String, charIndex:int, isWord:Boolean )
    {
        this.text = text; this.charIndex = charIndex; this.isWord = isWord;
    }
}
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
    matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it's null, then it's not a word and match[2] would be non-null)

在 pass 2 of 2 中，遍历匹配列表以查找缩写，方法是检查每个（修剪的、非单词）匹配是否以撇号结尾。如果是，则检查下一个相邻（单词）匹配项以查看它是否与仅有的 8 个常见收缩结尾之一匹配。尽管我能想到所有的两部分收缩，但只有 8 个共同的结尾。

d
l
ll
m
re
s
t
ve

一旦您确定了这样一对匹配项 (non-word)="'" 和 (word)="d"，那么您只需包含前面相邻的 (word) 匹配项并将三个匹配项连接起来即可得到您的收缩。

了解刚刚描述的过程，您必须进行的一项修改是扩展缩略词尾列表，以包括以撇号开头的缩略词，例如“'twas”和“'tis”。对于那些，您根本不连接前面的相邻（单词）匹配，并且您更仔细地查看撇号匹配以查看它之前是否包含其他非单词字符（这就是为什么它以撇号结尾很重要)。如果修剪后的字符串等于撇号，则将其与下一个匹配项合并，如果它仅以撇号结尾，则剥离撇号并将其与下一个匹配项合并。同样，包含先前匹配的条件应首先检查以确保以撇号结尾的（修剪的非单词）匹配等于撇号，因此不会意外包含额外的非单词字符。

您可能需要进行的另一项修改是扩展 8 个词尾的列表，以包括诸如“g'day”和“g'night”之类的整个词的词尾。同样，这是一个简单的修改，涉及对前面（单词）匹配的条件检查。如果是“g”，则包含它。

这个过程应该能捕捉到大部分的收缩，并且足够灵活，可以包含你能想到的新的。

数据结构如下所示。

Condition(Ending, PreCondition)

前置条件在哪里

"*", "!", or "<exact string>"

最终的条件列表如下所示：

new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");

如果你只是按照我解释的那样处理这些条件，那应该涵盖所有这 86 个收缩（以及更多）：

'tis'twas ain't are not 不能不能不能不能不能不能每个人的日子都没有 How'll How's I'd 我会我会我会我会不会可能可能不必不必不必她会她应该的他们会他们会他们不会我们会我们会我们会我们会不会什么是什么什么是什么时候什么时候什么时候在哪里 where's who's who's who'd who's who'd 为什么为什么会不会你会不会

在旁注中，不要忘记不使用撇号的俚语缩写，例如“gotta”>“got to”和“gonna”>“going to”。

这是最终的 AS3 代码。总体而言，您只需不到 50 行代码即可将文本解析为交替的单词和非单词组，并识别和合并缩略语。简单的。您甚至可以在 MatchedWord 类中添加一个布尔“isContraction”变量，并在识别出收缩时在下面的代码中设置标志。

//Automatically merge known contractions
var conditions:Array = [
    ["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
    ["l","*"],
    ["ll","*"],
    ["m","*"],
    ["re","*"],
    ["s","*"],
    ["t","*"],
    ["ve","*"],
    ["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
    ["tis","!"],
    ["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
    ["night","g"]
];
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
{
    var m:MatchedWord = matched_words[i];
    var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
    if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
    {
        var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
        var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
        for each (var condition:Array in conditions)
        {
            if (StringUtils.trim( m_next.text ) == condition[0])
            {
                var pre_condition:String = condition[1];
                switch (pre_condition)
                {
                    case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
                        if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                    case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
                        if (apostrophe_text == "'")
                        {
                            m.text += m_next.text;
                            m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
                            m.isContraction = true;
                            matched_words.splice( i + 1, 1 );
                        }
                        else
                        {   //strip apostrophe off end and merge with next item, nothing needs deleted
                            //preserve spaces and match start indexes by manipulating untrimmed strings
                            var apostrophe_end:int = m.text.lastIndexOf( "'" );
                            var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
                            m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
                            m_next.text = apostrophe_ending + m_next.text;
                            m_next.charIndex = m.charIndex + apostrophe_end;
                            m_next.isContraction = true;
                        }
                        break;
                    default: //conditional success, check prior match meets condition
                        if (m_prev != null && m_prev.text == pre_condition)
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                }
            }
        }
    }
}

【讨论】：

我还应该补充一点，一旦你识别出收缩并将它们过滤掉，就更容易返回处理引号和复数所有格（我说复数，因为该代码会得到大部分单数以“'s”结尾的所有格）。当然，如果你用单引号引用一个字符串并在其中使用复数所有格，那么这是一个难题，因为它在语法上是模棱两可的，除非......你只需识别一个开始引号，然后将所有结束引号视为复数所有格通过确定的下一个结束引用/开始引用或类似的东西。