【问题标题】:How to get the infinitive form of the verb using "stanza"?如何使用“stanza”获得动词的不定式形式?
【发布时间】:2020-12-08 15:17:06
【问题描述】:

如何用节找出句子中的不定式动词?

例子:

doc = "I need you to find the verbes in this sentence"
en_nlp = stanza.Pipeline('en', processors='tokenize,lemma,mwt,pos,depparse', verbose=False, use_gpu=False)
processed = en_nlp(doc)

print(*[f"id: {word.id}\t word: {word.text}\t POS: {word.pos}\t head id: {word.head}\t head: {sent.words[word.head-1].text if word.head > 0 else 'root'} \t deprel: {word.deprel}" for sent in processed.sentences for word in sent.words], sep='\n')

输出:

id: 1    word: I     POS: PRON   head id: 2  head: need      deprel: nsubj
id: 2    word: need  POS: VERB   head id: 0  head: root      deprel: root
id: 3    word: you   POS: PRON   head id: 2  head: need      deprel: obj
id: 4    word: to    POS: PART   head id: 5  head: find      deprel: mark
id: 5    word: find  POS: VERB   head id: 2  head: need      deprel: xcomp
id: 6    word: the   POS: DET    head id: 7  head: verbes    deprel: det
id: 7    word: verbes    POS: NOUN   head id: 5  head: find      deprel: obj
id: 8    word: in    POS: ADP    head id: 10     head: sentence      deprel: case
id: 9    word: this  POS: DET    head id: 10     head: sentence      deprel: det
id: 10   word: sentence  POS: NOUN   head id: 5  head: find      deprel: obl

但是,在这一行中:

id: 5 word: find POS: VERB head id: 2 head: need deprel: xcomp

我需要说它是不定式动词。

【问题讨论】:

    标签: python stanford-nlp part-of-speech stanza


    【解决方案1】:

    我有同样的问题,并不热衷于闯入分词器并最终调整了节 sentence.words。

    word.feats 表示这里的不定式动词形式,如 id 7 所示,我没有测试它的可靠性。

    test_resp = "He was a little scared to knock on the door"
    res = nlp(test_resp)
    res.sentences[0].words[4:8]
    

    给这个

    [{
       "id": 5,
       "text": "scared",
       "lemma": "scared",
       "upos": "ADJ",
       "xpos": "JJ",
       "feats": "Degree=Pos",
       "head": 0,
       "deprel": "root",
       "misc": "start_char=16|end_char=22"
     },
     {
       "id": 6,
       "text": "to",
       "lemma": "to",
       "upos": "PART",
       "xpos": "TO",
       "head": 7,
       "deprel": "mark",
       "misc": "start_char=23|end_char=25"
     },
     {
       "id": 7,
       "text": "knock",
       "lemma": "knock",
       "upos": "VERB",
       "xpos": "VB",
       "feats": "VerbForm=Inf",
       "head": 5,
       "deprel": "advcl",
       "misc": "start_char=26|end_char=31"
     },
     {
       "id": 8,
       "text": "on",
       "lemma": "on",
       "upos": "ADP",
       "xpos": "IN",
       "head": 10,
       "deprel": "case",
       "misc": "start_char=32|end_char=34"
     }]
    

    出于我的目的,将字符串“to verb”视为单个词汇项并将 word.text 更新为“to_verb”和动词的字符范围以匹配更有用。这使动词的 word.lemma 和 word.upos 作为 VERB 保持不变,但需要减少动词的头部和单词位置索引以及后面的单词以考虑删除“to”。

    deepcopy 保护原始示例以进行说明,最好尽可能避免。

    import re
    import sys
    from copy import deepcopy
    
    def patch_inf_verb(processed):
        """hack the parse to treat 'to VERB' as one word"""
     
        # modified sentence
        results = deepcopy(processed)
        
        # regex to captures the text and numerals in  word.misc, 
        # e.g., 'start_char=11|stop_char=13'
        misc_vals_re = re.compile("(start_char=)(\d+)(\|end_char=)(?P<end>\d+)")
    
        for result in results.sentences:
            for wdx, word in enumerate(result.words):
                
                # peek back for "to"
                if wdx > 0 and word.pos == "VERB":
                    one_back =  result.words[wdx - 1]
                    if one_back.text.lower() == "to" and one_back.head == word.id:
                        
                        word.text = "to_" + word.text
                        # word.upos = "VERB_INF"  # update upos tag or leave as is
    
                        # parse verb's character span string
                        vals = misc_vals_re.match(word.misc).groups()
                        assert vals is not None
       
                        # nudge word.misc start_char back to span one-back "to"
                        word.misc = f"{vals[0]}{int(vals[1])-3}{vals[2]}{int(vals[3])}"
                        assert misc_vals_re.match(word.misc) is not None
    
                        # decrement the indexes for verb position and beyond,
                        # the character spans don't change
                        for tdx in range(len(result.words)):
                            if result.words[tdx].id > wdx: result.words[tdx].id -= 1
                            if result.words[tdx].head > wdx: result.words[tdx].head -= 1
                        
                        # clobber the "to" after
                        del result.words[wdx - 1]
        return results
    
    def format_results(results):
        """results in table format"""
        results_str = '\n'.join(
            [
                "\t".join(
                        [
                            f"{key:5s}: {val}" 
                            for key, val in word.to_dict().items() 
                            if key not in ["lemma", "feats"]
                        ]
                    )
                    for sent in results.sentences 
                    for word in sent.words
                ]
            )
        return results_str
    
    

    OP 示例:

    print("python", sys.version)
    print("stanza version:", stanza.__version__)
    
    doc = "I need you to find the verbes in this sentence"
    en_nlp = stanza.Pipeline('en', processors='tokenize,lemma,mwt,pos,depparse', verbose=False, use_gpu=False)
    processed = en_nlp(doc)
    
    print('OP stanza before\n', format_results(processed))
    
    patched_to_verb = patch_inf_verb(processed)
    print("after patch_inf_verb\n", format_results(patched_to_verb))
    

    python 3.7.7 (default, Mar 26 2020, 15:48:22) 
    [GCC 7.3.0]
    stanza version: 1.1.1
    OP stanza before
     id   : 1   text : I    upos : PRON xpos : PRP  head : 2    deprel: nsubj   misc : start_char=0|end_char=1
    id   : 2    text : need upos : VERB xpos : VBP  head : 0    deprel: root    misc : start_char=2|end_char=6
    id   : 3    text : you  upos : PRON xpos : PRP  head : 2    deprel: obj misc : start_char=7|end_char=10
    id   : 4    text : to   upos : PART xpos : TO   head : 5    deprel: mark    misc : start_char=11|end_char=13
    id   : 5    text : find upos : VERB xpos : VB   head : 2    deprel: xcomp   misc : start_char=14|end_char=18
    id   : 6    text : the  upos : DET  xpos : DT   head : 7    deprel: det misc : start_char=19|end_char=22
    id   : 7    text : verbes   upos : NOUN xpos : NNS  head : 5    deprel: obj misc : start_char=23|end_char=29
    id   : 8    text : in   upos : ADP  xpos : IN   head : 10   deprel: case    misc : start_char=30|end_char=32
    id   : 9    text : this upos : DET  xpos : DT   head : 10   deprel: det misc : start_char=33|end_char=37
    id   : 10   text : sentence upos : NOUN xpos : NN   head : 5    deprel: obl misc : start_char=38|end_char=46
    after patch_inf_verb
     id   : 1   text : I    upos : PRON xpos : PRP  head : 2    deprel: nsubj   misc : start_char=0|end_char=1
    id   : 2    text : need upos : VERB xpos : VBP  head : 0    deprel: root    misc : start_char=2|end_char=6
    id   : 3    text : you  upos : PRON xpos : PRP  head : 2    deprel: obj misc : start_char=7|end_char=10
    id   : 4    text : to_find  upos : VERB xpos : VB   head : 2    deprel: xcomp   misc : start_char=11|end_char=18
    id   : 5    text : the  upos : DET  xpos : DT   head : 6    deprel: det misc : start_char=19|end_char=22
    id   : 6    text : verbes   upos : NOUN xpos : NNS  head : 4    deprel: obj misc : start_char=23|end_char=29
    id   : 7    text : in   upos : ADP  xpos : IN   head : 9    deprel: case    misc : start_char=30|end_char=32
    id   : 8    text : this upos : DET  xpos : DT   head : 9    deprel: det misc : start_char=33|end_char=37
    id   : 9    text : sentence upos : NOUN xpos : NN   head : 4    deprel: obl misc : start_char=38|end_char=46
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-04-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多