【问题标题】:extract substring from large string从大字符串中提取子字符串
【发布时间】:2021-12-23 16:43:33
【问题描述】:

我有一个字符串:

string="(2021-07-02 01:00:00 AM BST)  
---  
syl.hs has joined the conversation  
  
  

(2021-07-02 01:00:23 AM BST)  
---  
e.wang  
Good Morning
How're you?
  
  
  

(2021-07-02 01:05:11 AM BST)  
---  
wk.wang  
Hi, I'm Good.  
  
  

(2021-07-02 01:08:01 AM BST)  
---  
perter.derrek   
we got the update on work. 
It will get complete by next week.

(2021-07-15 08:59:41 PM BST)  
---  
ad.ft has left the conversation  
  
  
  
  
---  
  
* * *"

我只想提取对话文本(名称和时间戳之间的文本)预期输出为:

cmets=['早上好,你好吗?','嗨,我很好。','我们得到了 工作更新。它将在下周完成。']

我试过的是:

cmets=re.findall(r'---\s*\n(.(?:\n(?!(?:(\s\d{4}-\d {2}-\d{2}\s\d{2}:\d{2}:\d{2}\s*[AP]M\s+GMT\s*)\w+\s*\n )?---).))',string)

【问题讨论】:

    标签: python-3.x regex string


    【解决方案1】:

    您可以使用单个捕获组:

    ^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)
    

    模式匹配:

    • ^ 字符串开始
    • ---\s*\n 匹配 --- 可选的空白字符和换行符
    • (?!.* has (?:joined|left) the conversation|\* \* \*) 断言该行不包含 has joinedhas left 对话部分,或包含 * * *
    • \S.* 在行首和行的其余部分至少匹配一个非空白字符
    • ( Capture group 1(这将由 re.findall 返回)
      • (?:\n(?!\(\d|---).*)* 匹配所有不以 ( 开头的行和一个数字或 --
    • )关闭第一组

    查看regex demoPython demo

    例子

    pattern = r"^---\s*\n(?!.* has (?:joined|left) the conversation|\* \* \*)\S.*((?:\n(?!\(\d|---).*)*)"
    result = [m.strip() for m in re.findall(pattern, s, re.M) if m]
    print(result)
    

    输出

    ["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work. \nIt will get complete by next week.']
    

    【讨论】:

      【解决方案2】:

      我假设:

      • 感兴趣的文本在三行块之后开始:一行包含时间戳,然后是 "---" 行,它可以用空格填充到右侧,然后是由一串字母组成的行,其中包含一个既不在该字符串的开头也不在结尾的句点,并且该字符串可以在右侧用空格填充。
      • 感兴趣的文本块可能包含空行,空行是只包含空格和行终止符的字符串。
      • 感兴趣的文本块的最后一行不能为空行。

      我相信以下正则表达式(设置了多行 (m) 和不区分大小写 (i) 标志)满足这些要求。

      ^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n-{3} *\r?\n[a-z]+\.[a-z]+ *\r?\n((?:.*[^ (\n].*\r?\n| *\r?\n(?=(?: *\r?\n)*(?!\(\d{4}\-\d{2}\-\d{2} .*\)).*[^ (\n]))*)
      

      感兴趣的线块包含在捕获组 1 中。

      Start your engine!

      表达式的元素如下。

      ^\(\d{4}\-\d{2}\-\d{2} .*\) *\r?\n  # match timestamp line
      -{3} *\r?\n                         # match 3-hyphen line
      [a-z]+\.[a-z]+ *\r?\n               # match name
      (                                   # begin capture group 1
        (?:                               # begin non-capture group (a)
          .*[^ (\n].*\r?\n                # match a non-blank line
          |                               # or
          \ *\r?\n                        # match a blank line
          (?=                             # begin a positive lookahead
            (?:                           # begin non-capture group (b)
              \ *\r?\n                    # match a blank line
            )*                            # end non-capture group b and execute 0+ times
            (?!                           # begin a negative lookahead
              \(\d{4}\-\d{2}\-\d{2} .*\)  # match timestamp line
            )                             # end negative lookahead
            .*[^ (\n]                     # march a non-blank line
          )                               # end positive lookahead
        )*                                # end non-capture group a and execute 0+ times
      )                                   # end capture group 1
      

      【讨论】:

        【解决方案3】:

        这是一个自我记录的正则表达式,它将去除前导和尾随空格:

        (?x)(?m)(?s)                                                    # re.X, re.M, re.S (DOTALL)
        (?:                                                             # start of non capturing group
         ^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n  # date and time
         (?!---\s*\r?\nad\.ft has)                                      # next lines are not the ---\n\ad.ft etc.
         ---\s*\r?\n                                                    # --- line
         [\w.]+\s*\r?\n                                                 # name line
         \s*                                                            # skip leading whitespace
        )                                                               # end of non-capture group
        # The folowing is capture group 1. Match characters until you get to the next date-time:
        ((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
        

        See Regex Demo

        See Python Demo

        import re
        
        string = """(2021-07-02 01:00:00 AM BST)
        ---
        syl.hs has joined the conversation
        
        
        
        (2021-07-02 01:00:23 AM BST)
        ---
        e.wang
        Good Morning
        How're you?
        
        
        
        
        (2021-07-02 01:05:11 AM BST)
        ---
        wk.wang
        Hi, I'm Good.
        
        
        
        (2021-07-02 01:08:01 AM BST)
        ---
        perter.derrek
        we got the update on work.
        It will get complete by next week.
        
        (2021-07-15 08:59:41 PM BST)
        ---
        ad.ft has left the conversation
        
        
        
        
        ---
        
        * * *"""
        
        regex = r'''(?x)(?m)(?s)                                        # re.X, re.M, re.S (DOTALL)
        (?:                                                             # start of non capturing group
         ^\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)\s*\r?\n  # date and time
         (?!---\s*\r?\nad\.ft has)                                      # next lines are not the ---\n\ad.ft etc.
         ---\s*\r?\n                                                    # --- line
         [\w.]+\s*\r?\n                                                 # name line
         \s*                                                            # skip leading whitespace
        )                                                               # end of non-capture group
        # The folowing is capture group 1. Match characters until you get to the next date-time:
        ((?:(?!\s*\r?\n\(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2}\ [AP]M\ BST\)).)*)# skip trailing whitespace
        '''
        
        matches = re.findall(regex, string)
        print(matches)
        

        打印:

        ["Good Morning\nHow're you?", "Hi, I'm Good.", 'we got the update on work.\nIt will get complete by next week.']
        

        【讨论】:

          猜你喜欢
          • 2011-07-21
          • 1970-01-01
          • 2018-09-29
          • 1970-01-01
          • 2022-08-19
          • 2012-02-28
          • 2011-11-17
          • 1970-01-01
          相关资源
          最近更新 更多