如何从 BIO 分块句子中提取块？ - Python答案

【问题标题】：How to extract chunks from BIO chunked sentences? - python如何从 BIO 分块句子中提取块？ - Python
【发布时间】：2015-11-26 18:58:19
【问题描述】：

给出一个输入句子，有BIO chunk tags:

[('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]

我需要提取相关的短语，例如如果我想提取'NP'，我需要提取包含B-NP和I-NP的元组片段。

[出]：

[('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]

（注意：提取元组中的数字代表令牌索引。）

我尝试使用以下代码提取它：

def extract_chunks(tagged_sent, chunk_type):
    current_chunk = []
    current_chunk_position = []
    for idx, word_pos in enumerate(tagged_sent):
        word, pos = word_pos
        if '-'+chunk_type in pos: # Append the word to the current_chunk.
            current_chunk.append((word))
            current_chunk_position.append((idx))
        else:
            if current_chunk: # Flush the full chunk when out of an NP.
                _chunk_str = ' '.join(current_chunk) 
                _chunk_pos_str = '-'.join(map(str, current_chunk_position))
                yield _chunk_str, _chunk_pos_str 
                current_chunk = []
                current_chunk_position = []
    if current_chunk: # Flush the last chunk.
        yield ' '.join(current_chunk), '-'.join(current_chunk_position)


tagged_sent = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]
print (list(extract_chunks(tagged_sent, chunk_type='NP')))

但是当我有相同类型的相邻块时：

tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'),  ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]

print (list(extract_chunks(tagged_sent, chunk_type='NP')))

它输出这个：

[('The Mitsubishi Electric Company Managing Director', '0-1-2-3-4-5'), ('ramen', '7')]

而不是想要的：

[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]

如何从上面的代码中解决这个问题？

除了从上面的代码中完成之外，是否有更好的解决方案来提取特定 chunk_type 的所需块？

【问题讨论】：

标签： python list nlp text-parsing text-chunking

【解决方案1】：

试试这个，它将提取所有类型的块及其各自单词的索引。

def extract_chunks(tagged_sent, chunk_type='NP'):
    out_sen = []
    for idx, word_pos in enumerate(tagged_sent):
        word,bio = word_pos
        boundary,tag = bio.split("-") if "-" in bio else ('','O')
        if tag != chunk_type:continue
        if boundary == "B":
            out_sen.append([word, str(idx)])
        elif boundary == "I":
            out_sen[-1][0] += " "+ word
            out_sen[-1][-1] += "-"+ str(idx)
        else:
            out_sen.append([word, str(idx)])
    return out_sen

演示：

>>> tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'),  ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]
>>> output_sent = extract_chunks(tagged_sent)
>>> print map(tuple, output_sent)
[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]

【讨论】：

【解决方案2】：

def extract_chunks(tagged_sent, chunk_type):
    grp1, grp2, chunk_type = [], [], "-" + chunk_type
    for ind, (s, tp) in enumerate(tagged_sent):
        if tp.endswith(chunk_type):
            if not tp.startswith("B"):
                grp2.append(str(ind))
                grp1.append(s)
            else:
                if grp1:
                    yield " ".join(grp1), "-".join(grp2)
                grp1, grp2 = [s], [str(ind)]
    yield " ".join(grp1), "-".join(grp2)

输出：

In [2]: l = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'), ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'),
   ...:                ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]

In [3]: list(extract_chunks(l, "NP"))
Out[3]: 
[('The Mitsubishi Electric Company', '0-1-2-3'),
 ('Managing Director', '4-5'),
 ('ramen', '7')]

In [4]: l = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]

In [5]: list(extract_chunks(l, "NP"))
Out[5]: [('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]

【讨论】：

【解决方案3】：

我会这样做：

import re
def extract_chunks(tagged_sent, chunk_type):
    # compiles the expression we want to match
    regex = re.compile(chunk_type)

    # filters matched items in a dictionary whose keys are the matched indexes
    first_step = {index_:tag[0] for index_, tag in enumerate(tagged_sent) if regex.findall(tag[1])}

    # builds list of lists following output format
    second_step = []
    for key_ in sorted(first_step.keys()):
        if second_step and int(second_step [len(second_step )-1][1].split('-')[-1]) == key_ -1:           
            second_step[len(second_step)-1][0] += ' {0}'.format(first_step[key_])
            second_step[len(second_step)-1][1] += '-{0}'.format(str(key_))
        else:
            second_step.append([first_step[key_], str(key_)])

    # builds output in final format
    return [tuple(item) for item in second_step]

您可以调整它以使用生成器，而不是像我正在做的那样在内存中构建整个输出并重构它以获得更好的性能（我很着急，所以代码远非最佳）。

希望对你有帮助！

【讨论】：