如何对重叠范围的子字符串进行分组？答案

【问题标题】：How to group overlapping ranges of substrings?如何对重叠范围的子字符串进行分组？
【发布时间】：2022-12-24 22:38:22
【问题描述】：

我有以下格式的字典列表：

ldict = [
{'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
{'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
{'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
{'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
{'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'}
]

start_offset和end_offset表示字符串中子串的开始和结束位置。

我的目标是将重叠的字符串组合在一起以仅形成一行。 start_offset 将是最低位置，end_offset 将是最高位置。

输出示例：

ldict = [
{'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
{'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
{'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]

我的尝试：

import pandas as pd
final = []
for row in ldict:
  i1 = pd.Interval(row['start_offset'], row['end_offset'])
  semi_fin_list = []
  for one_row in ldict:
     i2 = pd.Interval(one_row['start_offset'], one_row['end_offset'])
     if i1.overlaps(i2):
         semi_fin_list.append(once)
  final.append(semi_fin_list)

在上面的尝试中，我可以得到一行的重叠，但被困在下一步我可以做什么来排序和组合行以保持不同的行。

我怎样才能达到同样的效果？我的尝试还没有得出结论，因为我仍然得到重复。

【问题讨论】：

遍历列表，比较字典中的偏移量和组重叠。
我卡住了。我确实尝试了一个嵌套的 for 循环，其中一行与所有其他行进行比较，但我得到了重复的行，不知道如何对它们进行排序。
@nifeco，请将您的代码添加到问题中。
@martineau 我只是在寻求帮助，你不需要无礼。我没有添加我的代码，因为我觉得它是错误的，而且可能有更好的我不知道的方法。
@OlvinRoght 编写尝试的代码需要时间，因为我正在无法复制粘贴的远程桌面上编写代码。

标签： python python-3.x

【解决方案1】：

您可以在合并之前根据 start_offset 进行排序：

ldict = [
    {'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
    {'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
    {'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
    {'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
    {'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'},
]
sorted_ldict = sorted(ldict, key=lambda d: d['start_offset'])
merged_ldict = [
    {
        'start_offset': sorted_ldict[0]['start_offset'],
        'end_offset': sorted_ldict[0]['end_offset'],
        'string_type': [sorted_ldict[0]['string_type']],
    }
]
for d in sorted_ldict[1:]:
    if d['start_offset'] > merged_ldict[-1]['end_offset']:
        merged_ldict.append(
            {
                'start_offset': d['start_offset'],
                'end_offset': d['end_offset'],
                'string_type': [d['string_type']],
            }
        )
    else:
        merged_ldict[-1]['end_offset'] = 
            max(merged_ldict[-1]['end_offset'], d['end_offset'])
        if d['string_type'] not in merged_ldict[-1]['string_type']:
            merged_ldict[-1]['string_type'].append(d['string_type'])
print(merged_ldict)

输出：

[
     {'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']}, 
     {'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']}, 
     {'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]

笔记： 你可以考虑使用像dataclass而不是原始字典。

【讨论】：

【解决方案2】：

ldict = [
    {'start_offset': 0, 'end_offset': 10, 'string_type': 'verb'},
    {'start_offset': 5, 'end_offset': 15, 'string_type': 'noun'},
    {'start_offset': 20, 'end_offset': 30, 'string_type': 'noun'},
    {'start_offset': 42, 'end_offset': 51, 'string_type': 'adj'},
    {'start_offset': 45, 'end_offset': 52, 'string_type': 'noun'}
]

string_type = []
new_ldict = []
i = 0
while i < len(ldict):
    start_offset = ldict[i]['start_offset']
    end_offset = ldict[i]['end_offset']
    string_type = [ldict[i]['string_type']]
    while i + 1 < len(ldict) and ldict[i + 1]['start_offset'] <= end_offset:
        end_offset = ldict[i + 1]['end_offset']
        string_type.append(ldict[i + 1]['string_type'])
        i += 1

    new_ldict.append({'stat_offset': start_offset, 'end_offset': end_offset, 'string_type': string_type})
    i += 1
print(new_ldict)

输出：

[{'stat_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']}, {'stat_offset': 20, 'end_offset': 30, 'string_type': ['noun']}, {'stat_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}]

【讨论】：

【解决方案3】：

您只需要遍历 ldict 并将上一项的 'end_offset' 与当前的 start_offset 进行比较。假设您的 ldict 按 'start_offset' 排序，您可以使用下一个代码：

res = []
for d in ldict:
    if not res or d['start_offset'] > last['end_offset']:
        last = {**d, 'string_type': [d['string_type']]}
        res.append(last)
    else:
        last['end_offset'] = d['end_offset']
        last['string_type'].append(d['string_type'])

如果你的ldict没有排序，你应该先排序：

from operator import itemgetter

...

ldict = sorted(ldict, key=itemgetter('start_offset'))

输出：

[
    {'start_offset': 0, 'end_offset': 15, 'string_type': ['verb', 'noun']},
    {'start_offset': 20, 'end_offset': 30, 'string_type': ['noun']},
    {'start_offset': 42, 'end_offset': 52, 'string_type': ['adj', 'noun']}
]

【讨论】：