围绕特定单词的句子选择答案

【问题标题】：Sentence selection surrounded to a particular words围绕特定单词的句子选择
【发布时间】：2021-01-26 11:19:11
【问题描述】：

假设我有一个段落：

Str_wrds ="Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs."

并有以下test_wrds，

Test_wrds = ['Power curve', 'data-driven','wind turbines']

每当Test_wrds 在段落中找到它时，我想选择 1 个句子的前后，并将它们列为单独的字符串。例如，Test_wrds Power curve 出现在第一个句子中，但是当我们选择第二个句子时，还有另一个 Power curve 单词因此输出会是这样的

Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.

同样，我想为data-driven 和wind turbines 分割句子并将它们保存在单独的字符串中。

如何使用 Python 以简单的方式实现这一点？

到目前为止，我发现只要有任何Text_wrds 出现，基本上都会删除整个句子的代码。

def remove_sentence(Str_wrds , Test_wrds):
    return ".".join((sentence for sentence in input.split(".")
                    if Test_wrds not in sentence))

但我不明白如何使用它来解决我的问题。

问题更新：基本上，每当段落中出现test_wrds 时，我想将该句子以及一个句子之前和之后切片并将其保存在单个字符串中。因此，例如对于三个text_wrds，我预计会得到三个字符串，它们基本上分别覆盖了带有text_wrds 的句子。我附上pdf，比如输出，我在找

【问题讨论】：

嗨，我不明白你说的这部分是什么意思。你能改写一下吗？谢谢“每当 Test_wrds 在段落中找到它时，我想在 1 个句子之前和之后选择它并将它们列为单独的字符串。例如，Test_wrds 功率曲线首先出现在第 1 个句子中，但是当我们选择第 2 个句子时，还有另一个功率曲线单词因此输出将类似于“
"Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. 你的输出应该是这样的
您的意思是标记spacy 而不是scapy？

标签： python text nlp nltk scapy

【解决方案1】：

你可以定义一个类似这样的函数

def find_sentences( word, text ):
    sentences = text.split('.')
    findings = []
    for i in range(len(sentences)):
        if word.lower() in sentences[i].lower():
            if i==0:
                findings.append( sentences[i+1]+'.' )
            elif i==len(sentences)-1:
                findings.append( sentences[i-1]+'.' )
            else:
                findings.append( sentences[i-1]+'.' + sentences[i+1]+'.' )
    return findings

这可以称为

findings = find_sentences( 'Power curve', Str_wrds )

有一些漂亮的印刷

for finding in findings:
print( finding +'\n')

我们得到结果

However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.

Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. Data-driven model accuracy is significantly affected by uncertainty.

The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.

The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines..

我希望这是你在寻找的东西:)

【讨论】：

我正在寻找包含文本单词的句子。所以是这样的：1句之前>包含单词的主句>1句之后

【解决方案2】：

当你说，

每当Test_wrds 在段落中找到它时，我想选择 1 个句子的前后，并将它们列为单独的字符串。

我猜你的意思是，所有包含Test_wrds中的单词之一的句子，它们之前和之后的句子也应该被选中。

功能

def remove_sentence(Str_wrds: str, Test_wrds):
    # store all selected sentences
    all_selected_sentences = {}
    # initialize empty dictionary
    for k in Test_wrds:
        # one element for each occurrence
        all_selected_sentences[k] = [''] * Str_wrds.lower().count(k.lower())

    # list of sentences
    sentences = Str_wrds.split(".")

    word_counter = {}.fromkeys(Test_wrds,0)

    for i, sentence in enumerate(sentences):
        for j, word in enumerate(Test_wrds):
            # case insensitive
            if word.lower() in sentence.lower():
                if i == 0:  # first sentence
                    chosen_sentences = sentences[0:2]
                elif i == len(sentences) - 1:  # last sentence
                    chosen_sentences = sentences[-2:]
                else:
                    chosen_sentences = sentences[i - 1:i + 2]

                # get which occurrence of the word is it
                k = word_counter[word]

                all_selected_sentences[word][k] += '.'.join(
                    [s for s in chosen_sentences
                        if s not in all_selected_sentences[word][k]]) + "."

                word_counter[word] += 1  # increment the word counter

    return all_selected_sentences

运行这个

answer = remove_sentence(Str_wrds, Test_wrds)
print(answer)

使用为Str_wrds 和Test_wrds 提供的值，返回此输出

{
    'Power curve': [
        'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height.',
        'Power curve, supplied by turbine manufacturers, are extensively used in condition monitoring, energy estimation, and improving operational efficiency. However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty.',
        ' The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated. This study proposes two approaches, namely, pointwise CIs and simultaneous CIs, to measure the uncertainty associated with an SVM-based power curve model. A radial basis function is taken as the kernel function to improve the accuracy of the SVM models.',
        ' The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
    ],
    'data-driven': [
        ' However, there is substantial uncertainty linked to power curve measurements as they usually take place only at hub height. Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods.',
        ' Data-driven model accuracy is significantly affected by uncertainty. Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression.',
        ' Therefore, an accurate estimation of uncertainty gives the confidence to wind farm operators for improving performance/condition monitoring and energy forecasting activities that are based on data-driven methods. The support vector machine (SVM) is a data-driven, machine learning approach, widely used in solving problems related to classification and regression. The uncertainty associated with models is quantified using confidence intervals (CIs), which are themselves estimated.'
    ],
    'wind turbines': [
        ' A radial basis function is taken as the kernel function to improve the accuracy of the SVM models. The proposed techniques are then verified by extensive 10 min average supervisory control and data acquisition (SCADA) data, obtained from pitch-controlled wind turbines. The results suggest that both proposed techniques are effective in measuring SVM power curve uncertainty, out of which, pointwise CIs are found to be the most accurate because they produce relatively smaller CIs.'
    ]
}

注意事项：

函数返回 dict 的 lists
每个键都是Test_wrds 中的一个词，列表元素是该词的一次出现。
例如，由于“功率曲线”一词在整个文本中出现了 4 次，因此输出中的“功率曲线”的值是 4 个元素的列表。

【讨论】：

不仅包括前后，还包括包含该text_wrds的句子。所以是这样的：1句之前>包含单词的主句>1句之后
可以用更少的代码更简化吗？可能使用列表压缩或使用其他库
我在 spyder 中运行了你的代码，我得到了三个“答案”列表，基本上是重复的句子。
我更新了我的问题以获得更多说明，请看一下。简而言之，我想要每三个 text_wrds 三个字符串。因此，第一串将在“功率曲线”的主句之前、之后以及主句中使用。我不希望你同时使用所有这些 texts_wrds，你必须一次使用一个。希望这是有道理的
您更新的代码与我正在寻找的代码接近。但它们显示在一个字符串中。我正在为每个 text_wrds 寻找什么，它们向我展示了字符串。我的意思是'power_curve'=答案1，'风力涡轮机'=答案2等等