如何包含选定的子字符串？答案

【问题标题】：How to include selected substrings?如何包含选定的子字符串？
【发布时间】：2018-10-09 02:59:40
【问题描述】：

我正在一个大字符串中搜索目标文本。我的代码选择字符串中的文本并在其前面显示 40 个字符和在其前面显示 40 个字符。相反，我希望在目标文本之前显示 2 个句子和 2 个句子。我的代码：

import re

sentence = "In addition, participation in life situations can be somewhat impaired because of communicative disabilities associated with the disorder and parents’ lack of resources for overcoming this aspect of the disability (i.e. communication devices). The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder. The program will use the ICF framework to identify those factors determined to be most beneficial and cost effective in optimising health, function and quality of life for the affected child and her family."

sub = "biopsychosocial model"

def find_all_substrings(string, sub):
    starts = [match.start() for match in re.finditer(re.escape(sub), string.lower())]
    return starts 

substrings = find_all_substrings(sentence, sub)
for pos in substrings: print(sentence[pos-40:pos+40])

如何在目标文本前面显示2句和在目标文本后面显示2句？

【问题讨论】：

标签： python string

【解决方案1】：

您可以先将文本拆分为句子，然后找到包含您要查找的子字符串的所有句子（及其索引）。然后在找到的句子周围分割句子。

这是一个例子（使用nltk.tokenize.sent_tokenize）：

from nltk.tokenize import sent_tokenize

text = "In addition, participation in life situations can be somewhat impaired because of communicative disabilities associated with the disorder and parents’ lack of resources for overcoming this aspect of the disability (i.e. communication devices). The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder. The program will use the ICF framework to identify those factors determined to be most beneficial and cost effective in optimising health, function and quality of life for the affected child and her family."
sentences = sent_tokenize(text)

sub = "biopsychosocial model"
matching_indices = [i for i, sentence in enumerate(sentences) if sub in sentence]

n_sent_padding = 1
displayed_sentences = [
    ' '.join(sentences[i-n_sent_padding:i+n_sent_padding+1])
    for i in matching_indices
]

这将找到包含子字符串的每个句子的索引（放在matching_indices）然后displayed_sentences包含匹配句子之前和之后的句子（编号根据n_sent_padding。

那么displayed_sentences就是：

['The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder.']

注意 nltk 是如何拆分句子的：有时它的做法有些奇怪（例如，在“先生”中拆分句号）。 This post 是关于如何调整句子标记器。

【讨论】：

你的答案比我脑子里想的要容易得多......