【问题标题】:Add symbol to UTF8 BOM string using re.sub [duplicate]使用 re.sub [重复] 将符号添加到 UTF8 BOM 字符串
【发布时间】:2023-01-31 23:25:53
【问题描述】:

我有 UTF8 BOM 文本文件的文件:

1
00:00:05,850 --> 00:00:07,713
Welcome to the Course.

2
00:00:08,550 --> 00:00:10,320
This course has been designed to teach you

3
00:00:10,320 --> 00:00:12,750
all about the,
...

我需要添加一个“;”在每组数字的末尾。我用代码来做到这一点:

import re

with open("/content/file.srt", "r", encoding='utf-8-sig') as strfile:
    str_file_content = strfile.read()
    print(str_file_content)

test = re.sub(r'^(\d{1,3})$', r'\1;', str_file_content)
test

结果:

1\n00:00:05,850 --> 00:00:07,713\nWelcome to the Course.\n\n2\n00:00:08,550 --> 00:00:10,320\nThis course has been designed to teach you\n\n

即符号“;”没有添加! 我期望的结果:

1;
00:00:05,850 --> 00:00:07,713
Welcome to the Course.

2;
00:00:08,550 --> 00:00:10,320
This course has been designed to teach you

3;
00:00:10,320 --> 00:00:12,750
all about the,
...

我做错了什么?

【问题讨论】:

  • 您要查找的行中是否有空格?
  • 正则表达式中的 ^$ 将其锚定到字符串的开头和结尾。也就是说,它仅在您的整个字符串仅包含 1-3 个数字字符。我认为 test = re.sub(r'(\d{1,3})', r'\1;', str_file_content) 会做正确的事。
  • 使用多行修饰符作为参数:re.sub(r'^(\d{1,3})$', r'\1;', str_file_content, flags=re.M)

标签: python regex


【解决方案1】:

您可以使用 MULTILINE 标志使 ^$ 的行为符合您的预期:

test = re.sub(r'^(d{1,3})$', r';', strtest, flags=re.MULTILINE)

【讨论】:

    【解决方案2】:

    这是一个可能的正则表达式:

    test = re.sub(r"(d+)
    ", r";
    ", str_file_content)
    

    【讨论】:

      猜你喜欢
      • 2015-12-10
      • 2021-06-29
      • 1970-01-01
      • 2013-07-26
      • 1970-01-01
      • 2021-10-18
      • 2012-12-15
      • 2012-10-06
      相关资源
      最近更新 更多