【语言处理与Python】9.1文法特征

为了获得更大的灵活性，我们改变我们对待文法类别，如S,NP,V的方式，我们将这些原子标签分解为类似字典的结构，以便可以提取一系列的值作为特征。

9.1文法特征

先从一个简单的例子开始，使用字典存储特征和他们的值。

>>>kim = {\'CAT\':\'NP\', \'ORTH\': \'Kim\', \'REF\': \'k\'}
>>>chase = {\'CAT\':\'V\', \'ORTH\': \'chased\', \'REL\': \'chase\'}

CAT:文法类别；ORTH:拼写；REF:给出指示物或者关系。在基于规则的文法上下文中，这样的特征和特征值对被称为特征结构。

根据需要，我们还可以添加特征

>>>chase[\'AGT\'] = \'sbj\'
>>>chase[\'PAT\'] = \'obj\'

AGT ：施事的角色，PAT:受事的角色，在这里是宾语。

例如，我们现在要处理句子：Kim chased Lee.

>>>sent = "Kim chased Lee"
>>>tokens = sent.split()
>>>lee = {\'CAT\':\'NP\', \'ORTH\': \'Lee\', \'REF\': \'l\'}
>>>def lex2fs(word):
    ... for fs in [kim, lee, chase]:
...         if fs[\'ORTH\'] ==word:
...             return fs
>>>subj, verb, obj = lex2fs(tokens[0]), lex2fs(tokens[1]), lex2fs(tokens[2])
>>>verb[\'AGT\'] = subj[\'REF\'] #agent of \'chase\' is Kim
>>>verb[\'PAT\'] =obj[\'REF\'] #patient of \'chase\' is Lee
>>>for kin [\'ORTH\', \'REL\', \'AGT\', \'PAT\']: #checkfeatstruct of\'chase\'
... print "%-5s =>%s"%(k, verb[k])
ORTH =>chased
REL =>chase
AGT =>k
PAT =>l

同样的方法可以适用不同的动词，可以添加更多的特征，例如：

>>>surprise = {\'CAT\':\'V\', \'ORTH\': \'surprised\', \'REL\': \'surprise\',
... \'SRC\': \'sbj\', \'EXP\': \'obj\'}

句法协议

动词的形态属性与主语名词短语的属性一起变化，这种变化被成为协议（agreement）。

例如：

a. the dog runs
b.*the dog run
a. the dogs run
b.*the dogs runs

我们可以使用改进文法的方式，来处理这种情况，下面是一个例子，但是需要注意，这种方法是非常麻烦的。

改进之前的文法：

(7) S -> NPVP
NP -> DetN
VP -> V
Det -> \'this\'
N -> \'dog\'
V -> \'runs\'

改进之后的文法：

(8) S -> NP_SGVP_SG
S -> NP_PLVP_PL
NP_SG-> Det_SGN_SG
NP_PL-> Det_PLN_PL
VP_SG-> V_SG
VP_PL-> V_PL
Det_SG-> \'this\'
Det_PL-> \'these\'
N_SG-> \'dog\'
N_PL-> \'dogs\'
V_SG-> \'runs\'
V_PL-> \'run\'

为了避免这种爆炸式的增加，我们可以使用属性和约束。

使用属性和约束

Det[NUM=sg]-> \'this\'
Det[NUM=pl]-> \'these\'
N[NUM=sg]-> \'dog\'
N[NUM=pl]-> \'dogs\'
V[NUM=sg]-> \'runs\'
V[NUM=pl]-> \'run\'

我们可以使用?n来改进：

S -> NP[NUM=?n]VP[NUM=?n]
NP[NUM=?n]-> Det[NUM=?n]N[NUM=?n]
VP[NUM=?n]-> V[NUM=?n]

但是有些词是对单复数没有挑剔的，有两种表示方法，很显然，第二种，是比第一种要简单明了的。

第一种：

Det[NUM=sg]-> \'the\' | \'some\' | \'several\'
Det[NUM=pl]-> \'the\' | \'some\' | \'several\'

第二种：

Det[NUM=?n]-> \'the\' | \'some\' | \'several\'

下面的代码演示了到目前为止在本章中介绍过的大多数想法：

>>>nltk.data.show_cfg(\'grammars/book_grammars/feat0.fcfg\')
%start S
####################
#GrammarProductions
####################
#S expansion productions
S -> NP[NUM=?n]VP[NUM=?n]
#NPexpansion productions
NP[NUM=?n]-> N[NUM=?n]
NP[NUM=?n]-> PropN[NUM=?n]
NP[NUM=?n]-> Det[NUM=?n]N[NUM=?n]
NP[NUM=pl]-> N[NUM=pl]
#VPexpansion productions
VP[TENSE=?t,NUM=?n]-> IV[TENSE=?t, NUM=?n]
VP[TENSE=?t,NUM=?n]-> TV[TENSE=?t,NUM=?n]NP
####################
#LexicalProductions
####################
Det[NUM=sg]-> \'this\' | \'every\'
Det[NUM=pl]-> \'these\' | \'all\'
Det-> \'the\' | \'some\' | \'several\'
PropN[NUM=sg]->\'Kim\' | \'Jody\'
N[NUM=sg]-> \'dog\' | \'girl\' | \'car\' | \'child\'
N[NUM=pl]-> \'dogs\' | \'girls\' | \'cars\' | \'children\'
IV[TENSE=pres, NUM=sg]-> \'disappears\' | \'walks\'
TV[TENSE=pres,NUM=sg]-> \'sees\' | \'likes\'
IV[TENSE=pres, NUM=pl]-> \'disappear\' | \'walk\'
TV[TENSE=pres,NUM=pl]-> \'see\' | \'like\'
IV[TENSE=past] -> \'disappeared\' | \'walked\'
TV[TENSE=past]-> \'saw\' | \'liked\'

下面的代码展示了，如何解析一句话：

如果文法无法分析输入，trees将为空，否则会包含一个或多个分析树。取决于舒服是否有句法歧义。

>>>tokens = \'Kim likes children\'.split()
>>>from nltk import load_parser �
>>>cp = load_parser(\'grammars/book_grammars/feat0.fcfg\', trace=2) �
>>>trees = cp.nbest_parse(tokens)
|.Kim .like.chil.|
|[----] . .| PropN[NUM=\'sg\']-> \'Kim\' *
|[----] . .| NP[NUM=\'sg\']-> PropN[NUM=\'sg\']*
|[----> . .| S[]-> NP[NUM=?n]*VP[NUM=?n]{?n: \'sg\'}
|. [----] .| TV[NUM=\'sg\',TENSE=\'pres\']-> \'likes\' *
|. [----> .| VP[NUM=?n,TENSE=?t]-> TV[NUM=?n,TENSE=?t]*NP[]
{?n: \'sg\', ?t: \'pres\'}
|. . [----]| N[NUM=\'pl\']-> \'children\' *
|. . [----]| NP[NUM=\'pl\']-> N[NUM=\'pl\']*
|. . [---->| S[]-> NP[NUM=?n]*VP[NUM=?n]{?n: \'pl\'}
|. [---------]| VP[NUM=\'sg\',TENSE=\'pres\']
-> TV[NUM=\'sg\',TENSE=\'pres\']NP[]*
|[==============]| S[]-> NP[NUM=\'sg\']VP[NUM=\'sg\']*

最后，可以检查分析树：

>>>for tree in trees: print tree
(S[]
(NP[NUM=\'sg\'] (PropN[NUM=\'sg\'] Kim))
(VP[NUM=\'sg\', TENSE=\'pres\']
(TV[NUM=\'sg\', TENSE=\'pres\']likes)
(NP[NUM=\'pl\'] (N[NUM=\'pl\'] children))))

术语

像sg,pl这样的简单的值通常被成为原子。原子值的一种特殊情况是布尔值，仅仅指定一个属性是真还是假。

例如AUX代表助动词。

V[TENSE=pres,aux=+]->\'can\'

有的时候，我们可以将协议特征组合在一起，作为一个类别的不同部分，表示AGR的值。

属性值矩阵：AVM

[POS = N ]
[                  ]
[AGR = [PER = 3 ]]
[ [NUM = pl ]]
[ [GND = fem ]]

当有复杂的属性时，可以重构文法：

S -> NP[AGR=?n]VP[AGR=?n]
NP[AGR=?n]-> PropN[AGR=?n]
VP[TENSE=?t,AGR=?n]-> Cop[TENSE=?t,AGR=?n]Adj
Cop[TENSE=pres, AGR=[NUM=sg,PER=3]]-> \'is\'
PropN[AGR=[NUM=sg,PER=3]]-> \'Kim\'
Adj-> \'happy\'