使用非常简单的两阶段过程。
在 pass 1 of 2 中,从这个正则表达式开始,将文本分解为单词和非单词字符的交替段。
/(\w+)|(\W+)/gi
将匹配项存储在这样的列表中(我使用的是 AS3 样式的伪代码,因为我不使用 ruby):
class MatchedWord
{
var text:String;
var charIndex:int;
var isWord:Boolean;
var isContraction:Boolean = false;
function MatchedWord( text:String, charIndex:int, isWord:Boolean )
{
this.text = text; this.charIndex = charIndex; this.isWord = isWord;
}
}
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it's null, then it's not a word and match[2] would be non-null)
在 pass 2 of 2 中,遍历匹配列表以查找缩写,方法是检查每个(修剪的、非单词)匹配是否以撇号结尾。如果是,则检查下一个相邻(单词)匹配项以查看它是否与仅有的 8 个常见收缩结尾之一匹配。尽管我能想到所有的两部分收缩,但只有 8 个共同的结尾。
d
l
ll
m
re
s
t
ve
一旦您确定了这样一对匹配项 (non-word)="'" 和 (word)="d",那么您只需包含前面相邻的 (word) 匹配项并将三个匹配项连接起来即可得到您的收缩。
了解刚刚描述的过程,您必须进行的一项修改是扩展缩略词尾列表,以包括以撇号开头的缩略词,例如“'twas”和“'tis”。对于那些,您根本不连接前面的相邻(单词)匹配,并且您更仔细地查看撇号匹配以查看它之前是否包含其他非单词字符(这就是为什么它以撇号结尾很重要)。如果修剪后的字符串等于撇号,则将其与下一个匹配项合并,如果它仅以撇号结尾,则剥离撇号并将其与下一个匹配项合并。同样,包含先前匹配的条件应首先检查以确保以撇号结尾的(修剪的非单词)匹配等于撇号,因此不会意外包含额外的非单词字符。
您可能需要进行的另一项修改是扩展 8 个词尾的列表,以包括诸如“g'day”和“g'night”之类的整个词的词尾。同样,这是一个简单的修改,涉及对前面(单词)匹配的条件检查。如果是“g”,则包含它。
这个过程应该能捕捉到大部分的收缩,并且足够灵活,可以包含你能想到的新的。
数据结构如下所示。
Condition(Ending, PreCondition)
前置条件在哪里
"*", "!", or "<exact string>"
最终的条件列表如下所示:
new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");
如果你只是按照我解释的那样处理这些条件,那应该涵盖所有这 86 个收缩(以及更多):
'tis'twas ain't are not 不能 不能 不能 不能 不能 不能
每个人的日子都没有
How'll How's I'd 我会 我会 我会 我会不会
可能 可能 不必 不必 不必
她会 她应该的
他们会 他们会 他们不会 我们会 我们会 我们会 我们会不会
什么是什么什么是什么时候什么时候什么时候在哪里
where's who's who's who'd who's who'd 为什么
为什么会不会 你会不会
在旁注中,不要忘记不使用撇号的俚语缩写,例如“gotta”>“got to”和“gonna”>“going to”。
这是最终的 AS3 代码。总体而言,您只需不到 50 行代码即可将文本解析为交替的单词和非单词组,并识别和合并缩略语。简单的。您甚至可以在 MatchedWord 类中添加一个布尔“isContraction”变量,并在识别出收缩时在下面的代码中设置标志。
//Automatically merge known contractions
var conditions:Array = [
["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
["l","*"],
["ll","*"],
["m","*"],
["re","*"],
["s","*"],
["t","*"],
["ve","*"],
["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
["tis","!"],
["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
["night","g"]
];
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
{
var m:MatchedWord = matched_words[i];
var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
{
var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
for each (var condition:Array in conditions)
{
if (StringUtils.trim( m_next.text ) == condition[0])
{
var pre_condition:String = condition[1];
switch (pre_condition)
{
case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
{
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );
}
break;
case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
if (apostrophe_text == "'")
{
m.text += m_next.text;
m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
m.isContraction = true;
matched_words.splice( i + 1, 1 );
}
else
{ //strip apostrophe off end and merge with next item, nothing needs deleted
//preserve spaces and match start indexes by manipulating untrimmed strings
var apostrophe_end:int = m.text.lastIndexOf( "'" );
var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
m_next.text = apostrophe_ending + m_next.text;
m_next.charIndex = m.charIndex + apostrophe_end;
m_next.isContraction = true;
}
break;
default: //conditional success, check prior match meets condition
if (m_prev != null && m_prev.text == pre_condition)
{
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );
}
break;
}
}
}
}
}