【问题标题】:Remove specific sentences from string从字符串中删除特定句子
【发布时间】:2020-09-18 12:05:16
【问题描述】:

我有以下格式的字符串:(带有 3 个或更多空格的句子和这些句子之间的句子是表格数据的一部分)

Some Sentence
Some sentence


Balance at January 1,                                $421            $51
Additions based on tax positions related to the

current year                                                    4        34         9

Additions based on acquisitions                           -       -       2
Additions based on tax positions related to prior

years                                                    21       13     374
Reductions for tax positions of prior years                (54)     (43)      -

Some paragraph
Some paragraph

Balance at January 1,                                $421            $51
Additions based on tax positions related to the

current year                                                    4        34         9

Additions based on acquisitions                           -       -       2
Additions based on tax positions related to prior

years                                                    21       13     374
Reductions for tax positions of prior years                (54)     (43)      -

我需要从包含 3 个或更多空格的字符串中删除所有句子,记住应该保留实际的段落内容。

以下是我的方法,它没有给我准确的结果,我也不喜欢使用 range(5):

for i in range(5):
result = re.sub('[\\n-].* {3,}.*\\n', '', result)
print(result)

我的逻辑输出:

Some Sentence
Some sentence


Additions based on tax positions related to the
Additions based on tax positions related to prior



Some paragraph
Some paragraph

Additions based on tax positions related to the
Additions based on tax positions related to prior


预期输出:

Some Sentence
Some sentence


Some paragraph
Some paragraph



还有什么办法让句子之间的句子(有 3 个或更多空格)也被删除?

【问题讨论】:

    标签: python python-3.x


    【解决方案1】:
    sentences = """
    Some Sentence
    Some sentence
    
    
    Additions based on tax positions related to the
    Additions based on tax positions related to prior
    
    
    
    Some paragraph
    Some paragraph
    
    Additions based on tax positions related to the
    Additions based on tax positions related to prior
    """
    
    splitted_sentences = sentences.split('\n')
    
    only_short_sentences = [line for line in splitted_sentences if len(line.split()) <3]
    short_sentences_str = '\n'.join(only_short_sentences)
    print(short_sentences_str)
    

    输出:

    Some Sentence
    Some sentence
    
    
    
    
    
    Some paragraph
    Some paragraph
    

    如果您想丢弃空行 - 转换为此版本的列表理解:

    only_short_sentences = [line for line in splitted_sentences if len(line.split()) <3 and line]
    

    这是预期的结果吗?

    已编辑

    输入:

    sentences = """
    Some Sentence
    Some sentence
    
    
    Balance at January 1,                                $421            $51
    Additions based on tax positions related to the
    
    current year                                                    4        34         9
    
    Additions based on acquisitions                           -       -       2
    Additions based on tax positions related to prior
    
    years                                                    21       13     374
    Reductions for tax positions of prior years                (54)     (43)      -
    
    Some paragraph
    Some paragraph
    
    Balance at January 1,                                $421            $51
    Additions based on tax positions related to the
    
    current year                                                    4        34         9
    
    Additions based on acquisitions                           -       -       2
    Additions based on tax positions related to prior
    
    years                                                    21       13     374
    Reductions for tax positions of prior years                (54)     (43)      -
    """
    

    输出:

    Some Sentence
    Some sentence
    
    
    
    
    
    
    Some paragraph
    Some paragraph
    

    【讨论】:

    • 否@Yossi Levi,请参考问题中的预期结果。换行符也应该保留在输出中。
    • 修复了这部分。现在正如预期的那样?编辑为保留空行的版本,以及丢弃它们的版本。
    • 在用实际句子替换“Some sentence”和用实际段落替换“Some paragraph”后尝试了解决方案,但没有奏效。
    • 我编辑了答案,以便您可以看到真实句子的输出。以您在帖子中提供的示例并进行尝试。工作完美,注意所有其他句子有超过 3 个单词(更准确地说,它们有超过 2 个空格)
    • 嗨@Yossi Levi,我同意您的代码适用于给定的输入,但您可以尝试使用此处的字符串:regex101.com/r/hOrp78/1
    【解决方案2】:

    对此有一个简单的正则表达式(我已将您的输入放入文件“test.txt”):

    grep -v " .* .* " test.txt
    

    如您所见,".*" 之间只是空格的三倍,它代表“每个可能的字符,重复未知次数(可能为零)”。
    哦,在我忘记之前:"-v" 代表“not to see in the results”。

    显然您知道re Python 库,因此您可能知道如何将这个正则表达式嵌入到您的Python 源代码中。

    祝你好运

    【讨论】:

    • 嗨@Dominique,我知道python中grep的替代品是re.findall,但我不知道如何从答案中合并-v。你能帮忙吗?
    猜你喜欢
    • 2015-10-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-06-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多