如何删除文本中单词末尾可能出现的数字答案

【问题标题】：How can I remove numbers that may occur at the end of words in a text如何删除文本中单词末尾可能出现的数字
【发布时间】：2019-05-29 06:39:46
【问题描述】：

我有要使用正则表达式清理的文本数据。但是，文本中的某些单词后面紧跟着我要删除的数字。

例如，一行文字是：

前言2 贡献者4 缩写5 致谢8 Pes 术语 10 RUPES 项目的经验教训 12 越南环境服务及其潜力和范例16 章将生态系统服务支付纳入越南政策和计划 17 章为 Tri An 流域创造激励 protection20 章景观美的可持续融资白马国家公园24章建立碳支付机制 Hoa 的 Cao Phong 区的林业封存试点项目越南平省26 第 5 章地方收入分享芽庄湾越南海洋保护区28 综合和建议30 参考文献32

以上文本中的第一个单词应该是'preface'而不是'preface2'等等。

line = re.sub(r"[A-Za-z]+(\d+)", "", line)

但是，这会删除单词以及所见：

Pes 从 RUPES 支付环境服务中吸取的经验教训以及集成支付一章中的潜力和示例生态系统服务纳入越南政策和章节创建激励 Tri An 流域章节可持续景观融资白马国家公园之美第24章建立支付机制 Cao Phong 林业碳封存试点项目华平省第 5 章地方收入分享 Nha 董里湾海洋保护区综合与

我怎样才能只捕捉紧跟单词的数字？

【问题讨论】：

标签： python regex regex-group

【解决方案1】：

您可以捕获文本部分并用捕获的部分替换单词。它只是写：

re.sub(r"([A-Za-z]+)\d+", r"\1", line)

【讨论】：

你能解释一下r"\1"做什么吗？

【解决方案2】：

您可以尝试先行断言来检查数字之前的单词。在强制正则表达式仅匹配单词末尾的数字时尝试单词边界 (\b)：

re.sub(r'(?<=\w+)\d+\b', '', line)

希望对你有帮助

编辑：抱歉，在 cmets 中提到的关于匹配数字的故障也没有以单词开头。那是因为（再次抱歉） \w 匹配字母数字字符，而不仅仅是字母字符。根据您要删除的内容，您可以使用正面版本

re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)

只检查数字或否定版本之前的英文字母字符（您可以将字符添加到 [a-zA-Z] 列表中）

re.sub(r'(?<![\d\s])\d+\b', '', line)

匹配您想要的数字之前不是 \d （数字）或 \s （空格）的任何内容。不过，这也会匹配标点符号。

【讨论】：

这在大多数情况下都有效。但是，它也会删除未附加到单词/以空格分隔的数字。
抱歉，我编辑了我的答案并缩短了旧部分，以便它仍然可读。

【解决方案3】：

试试这个：

line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number    
line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one    
line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one

\\1 将匹配单词，\\2 将匹配数字。见：How to use python regex to replace using captured group?

【讨论】：

【解决方案4】：

下面，我提出了一个可以解决您的问题的代码示例。

这是sn-p：

import re

# I'will write a function that take the test data as input and return the
# desired result as stated in your question.

def transform(data):
    """Replace in a text data words ending with number.""""
    # first, lest construct a pattern matching those words we're looking for
    pattern1 = r"([A-Za-z]+\d+)"

    # Lest construct another pattern that will replace the previous in the final
    # output.
    pattern2 = r"\d+$"

    # Let find all matching words
    matches = re.findall(pattern1, data)

    # Let construct a list of replacement for each word
    replacements = []
    for match in matches:
        replacements.append(pattern2, '', match)

    # Intermediate variable to construct tuple of (word, replacement) for
    # use in string method 'replace'
    changers = zip(matches, replacements)

    # We now recursively change every appropriate word matched.
    output = data
    for changer in changers:
        output.replace(*changer)

    # The work is done, we can return the result
    return output

出于测试目的，我们使用您的测试数据运行上述函数：

data = """
Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons     
learnt from the RUPES project12 Payment for environmental service and it potential and 
example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams 
policy and programmes17 Chapter Creating incentive for Tri An watershed protection20 
Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter 
Building payment mechanism for carbon sequestration in forestry a pilot project in Cao 
Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang 
Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
"""

result = transform(data)

print(result)

结果如下所示：

Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from 
the RUPES project Payment for environmental service and it potential and example in 
Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and 
programmes Chapter Creating incentive for Tri An watershed protection Chapter 
Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building 
payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong 
district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay 
Marine Protected Area Vietnam Synthesis and Recommendations References

【讨论】：

【解决方案5】：

您也可以创建一系列数字：

re.sub(r"[0-9]", "", line)

【讨论】：