【问题标题】:How can I remove numbers that may occur at the end of words in a text如何删除文本中单词末尾可能出现的数字
【发布时间】:2019-05-29 06:39:46
【问题描述】:

我有要使用正则表达式清理的文本数据。但是,文本中的某些单词后面紧跟着我要删除的数字。

例如,一行文字是:

前言2 贡献者4 缩写5 致谢8 Pes 术语 10 RUPES 项目的经验教训 12 越南环境服务及其潜力和范例16 章将生态系统服务支付纳入越南政策 和计划 17 章 为 Tri An 流域创造激励 protection20 章景观美的可持续融资 白马国家公园24章建立碳支付机制 Hoa 的 Cao Phong 区的林业封存试点项目 越南平省26 第 5 章地方收入分享 芽庄湾 越南海洋保护区28 综合和建议30 参考文献32

以上文本中的第一个单词应该是'preface'而不是'preface2'等等。

line = re.sub(r"[A-Za-z]+(\d+)", "", line)

但是,这会删除单词以及所见:

Pes 从 RUPES 支付环境服务中吸取的经验教训 以及集成支付一章中的潜力和示例 生态系统服务纳入越南政策和章节创建激励 Tri An 流域章节可持续景观融资 白马国家公园之美第24章建立支付机制 Cao Phong 林业碳封存试点项目 华平省第 5 章地方收入分享 Nha 董里湾海洋保护区综合与

我怎样才能只捕捉紧跟单词的数字?

【问题讨论】:

    标签: python regex regex-group


    【解决方案1】:

    您可以捕获文本部分并用捕获的部分替换单词。它只是写:

    re.sub(r"([A-Za-z]+)\d+", r"\1", line)
    

    【讨论】:

    • 你能解释一下r"\1"做什么吗?
    【解决方案2】:

    您可以尝试先行断言来检查数字之前的单词。在强制正则表达式仅匹配单词末尾的数字时尝试单词边界 (\b):

    re.sub(r'(?<=\w+)\d+\b', '', line)
    

    希望对你有帮助

    编辑: 抱歉,在 cmets 中提到的关于匹配数字的故障也没有以单词开头。那是因为(再次抱歉) \w 匹配字母数字字符,而不仅仅是字母字符。根据您要删除的内容,您可以使用正面版本

    re.sub(r'(?<=[a-zA-Z])\d+\b', '', line)
    

    只检查数字或否定版本之前的英文字母字符(您可以将字符添加到 [a-zA-Z] 列表中)

    re.sub(r'(?<![\d\s])\d+\b', '', line)
    

    匹配您想要的数字之前不是 \d (数字)或 \s (空格)的任何内容。不过,这也会匹配标点符号。

    【讨论】:

    • 这在大多数情况下都有效。但是,它也会删除未附加到单词/以空格分隔的数字。
    • 抱歉,我编辑了我的答案并缩短了旧部分,以便它仍然可读。
    【解决方案3】:

    试试这个:

    line = re.sub(r"([A-Za-z]+)(\d+)", "\\2", line) #just keep the number    
    line = re.sub(r"([A-Za-z]+)(\d+)", "\\1", line) #just keep the word
    line = re.sub(r"([A-Za-z]+)(\d+)", r"\2", line) #same as first one    
    line = re.sub(r"([A-Za-z]+)(\d+)", r"\1", line) #same as second one
    

    \\1 将匹配单词,\\2 将匹配数字。见:How to use python regex to replace using captured group?

    【讨论】:

      【解决方案4】:

      下面,我提出了一个可以解决您的问题的代码示例。

      这是sn-p:

      import re
      
      # I'will write a function that take the test data as input and return the
      # desired result as stated in your question.
      
      def transform(data):
          """Replace in a text data words ending with number.""""
          # first, lest construct a pattern matching those words we're looking for
          pattern1 = r"([A-Za-z]+\d+)"
      
          # Lest construct another pattern that will replace the previous in the final
          # output.
          pattern2 = r"\d+$"
      
          # Let find all matching words
          matches = re.findall(pattern1, data)
      
          # Let construct a list of replacement for each word
          replacements = []
          for match in matches:
              replacements.append(pattern2, '', match)
      
          # Intermediate variable to construct tuple of (word, replacement) for
          # use in string method 'replace'
          changers = zip(matches, replacements)
      
          # We now recursively change every appropriate word matched.
          output = data
          for changer in changers:
              output.replace(*changer)
      
          # The work is done, we can return the result
          return output
      

      出于测试目的,我们使用您的测试数据运行上述函数:

      data = """
      Preface2 Contributors4 Abrreviations5 Acknowledgements8 Pes terminology10 Lessons     
      learnt from the RUPES project12 Payment for environmental service and it potential and 
      example in Vietnam16 Chapter Integrating payment for ecosystem service into Vietnams 
      policy and programmes17 Chapter Creating incentive for Tri An watershed protection20 
      Chapter Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter 
      Building payment mechanism for carbon sequestration in forestry a pilot project in Cao 
      Phong district of Hoa Binh province Vietnam26 Chapter 5 Local revenue sharing Nha Trang 
      Bay Marine Protected Area Vietnam28 Synthesis and Recommendations30 References32
      """
      
      result = transform(data)
      
      print(result)
      

      结果如下所示:

      Preface Contributors Abrreviations Acknowledgements Pes terminology Lessons learnt from 
      the RUPES project Payment for environmental service and it potential and example in 
      Vietnam Chapter Integrating payment for ecosystem service into Vietnams policy and 
      programmes Chapter Creating incentive for Tri An watershed protection Chapter 
      Sustainable financing for landscape beauty in Bach Ma National Park 24 Chapter Building 
      payment mechanism for carbon sequestration in forestry a pilot project in Cao Phong 
      district of Hoa Binh province Vietnam Chapter 5 Local revenue sharing Nha Trang Bay 
      Marine Protected Area Vietnam Synthesis and Recommendations References
      

      【讨论】:

        【解决方案5】:

        您也可以创建一系列数字:

        re.sub(r"[0-9]", "", line)
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多