【问题标题】:Add a single space and comma between words that are connected using regex在使用正则表达式连接的单词之间添加一个空格和逗号
【发布时间】:2020-07-25 16:32:15
【问题描述】:

我有一个嵌套的 list_3,它看起来像:

[['Company OverviewCompany: HowSector: SoftwareYear Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more togetherUniversity Affiliation(s): Duke$ Raised: $240,000Investors: Friends & familyTraction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company OverviewCompany: GrubSector: SoftwareYear Founded: 2018One Sentence Pitch: Find food you likeUniversity Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & familyTraction to Date: 40% of monthly active users (MAU) are also active weekly']]]

我想使用正则表达式在每个连接的单词之间添加一个逗号,后跟一个空格,即(HowSector:, SoftwareYear, 2010One),到目前为止,我已经尝试编写一个 re.sub 代码来做,通过选择所有没有空格的字符并替换它,但遇到了一些问题:


for i, list in enumerate(list_3):
    list_3[i] = [re.sub('r\s\s+', ', ', word) for word in list]
    list_33.append(list_3[i])
print(list_33)

错误:

return _compile(pattern, flags).sub(repl, string, count)

TypeError: expected string or bytes-like object

我希望输出是:

[['Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together University, Affiliation(s): Duke, $ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'],[...]]

任何想法我可以如何使用正则表达式来做到这一点?

【问题讨论】:

  • 您打算如何区分OverviewCompanysoftwarefeedback?如果答案是“大写”,那么您尝试的正则表达式将无法正常工作
  • @DeepSpace 是的,我想尝试正则表达式来搜索以大写字母开头且它们之间没有任何空格的实例,这是正在查看的示例,geeksforgeeks.org/…,但我不能想办法。
  • stackoverflow.com/questions/15343163/…,这是一个类似的问题,但在 Java 中
  • ', '.join(re.split(r'(?<=[a-z])(?=[A-Z])',each_string_in_nested_list))

标签: python regex nested-lists substitution word-spacing


【解决方案1】:

主要问题是您的嵌套列表没有固定级别。有时它有 2 个级别,有时它有 3 个级别。这就是您收到上述错误的原因。如果列表有 3 个级别,re.sub 接收一个列表作为第三个参数而不是字符串。

第二个问题是您使用的正则表达式不是正确的正则表达式。我们可以在这里使用的最简单的正则表达式应该(至少)能够找到一个非空白字符后跟一个大写字母。

在下面的示例代码中,我使用的是re.compile(因为会一遍又一遍地使用相同的正则表达式,我们不妨预编译它并获得一些性能提升),我只是打印输出。您需要想办法以您想要的格式获取输出。

regex = re.compile(r'(\S)([A-Z])')
replacement = r'\1, \2'
for inner_list in nested_list:
    for string_or_list in inner_list:
        if isinstance(string_or_list, str):
            print(regex.sub(replacement, string_or_list))
        else:
            for string in string_or_list:
                print(regex.sub(replacement, string))

输出

Company Overview, Company: How, Sector: Software, Year Founded: 2010, One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000, Investors: Friends & family, Traction to Date: 10% of monthly active users (, MA, U) are also active weekly
Company Overview, Company: Grub, Sector: Software, Year Founded: 2018, One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000, Investors: Friends & family, Traction to Date: 40% of monthly active users (, MA, U) are also active weekly

【讨论】:

    【解决方案2】:

    相信你可以使用下面的 Python 代码。

    rgx = r'(?<=[a-z\d])([A-Z$][A-Za-z]*(?: +\S+?)*)*:'
    rep = r', \1:'
    re.sub(rgx, rep, s)
    

    s 是字符串。

    Start your engine! | Python code

    Python 的正则表达式引擎在匹配时会执行以下操作。

    (?<=          : begin positive lookbehind
      [a-z\d]     : match a letter or digit
    )             : end positive lookbehind
    (             : begin capture group 1
      [A-Z$]      : match a capital letter or '$'
      [A-Za-z]*   : match 0+ letters
      (?: +\S+?)  : match 1+ spaces greedily, 1+ non-spaces
                    non-greedily in a non-capture group
      *           : execute non-capture group 0+ times
    )             : end capture group
    :             : match ':'
    

    请注意,捕获组中每个标记的正向后视和允许的字符可能需要进行调整以满足要求。

    用于构造替换字符串 (, \1:) 的正则表达式创建字符串 ', ',后跟捕获组 1 的内容,后跟冒号。

    【讨论】:

      【解决方案3】:

      如果您的列表任意深度,您可以递归遍历它并处理(使用THIS 正则表达式)字符串并产生相同的结构:

      import re   
      from collections.abc import Iterable 
      
      def process(l):
          for el in l:
              if isinstance(el, Iterable) and not isinstance(el, (str, bytes)):
                  yield type(el)(process(el))
              else:
                  yield ', '.join(re.split(r'(?<=[a-z])(?=[A-Z])', el))   
      

      LoL 为例,结果如下:

      >>> list(process(LoL))
      [['Company Overview, Company: How, Sector: Software, Year Founded: 2010One Sentence Pitch: Easily give and request low-quality feedback with your team to achieve more together, University Affiliation(s): Duke$ Raised: $240,000Investors: Friends & family, Traction to Date: 10% of monthly active users (MAU) are also active weekly'], [['Company Overview, Company: Grub, Sector: Software, Year Founded: 2018One Sentence Pitch: Find food you like, University Affiliation(s): Stanford$ Raised: $340,000Investors: Friends & family, Traction to Date: 40% of monthly active users (MAU) are also active weekly']]]
      

      【讨论】:

      • 2010One 应该变成2010, One
      • 易改:re.split(r'(?&lt;=\S)(?=[A-Z])'
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-08-24
      • 2021-01-06
      • 2012-12-10
      相关资源
      最近更新 更多