Python 3.6.1 |正则表达式搜索具有特殊字符的文件答案

【问题标题】：Python 3.6.1 | Regex Search on files with special charactersPython 3.6.1 |正则表达式搜索具有特殊字符的文件
【发布时间】：2017-09-13 05:37:43
【问题描述】：

我打算做什么？

在 Windows 文件系统上的一组文件中执行 alphabetic 字符串列表的搜索（大约 25K 个不同大小和扩展名的文件，主要是纯文本文件，最大文件不超过几 MB大小）

我做了什么来实现这个目标？

for each_file in files:
    file_read_handle = open(each_file,"rb")
    file_read_handle.seek(0) #ensure you're at the start of the file
    first_char = file_read_handle.read(1) #get the first character
    if first_char:
        file_read_content_mappd = mmap.mmap(file_read_handle.fileno(), 0, access=mmap.ACCESS_READ)
        if re.search(br'(?i)T_0008X_WEB', file_read_content_mappd):
            file_write_content = ('Text T_0008X_WEB found in {}'.format(each_file))
            file_write_handle.write(file_write_content)     
            file_write_handle.write("\n")
file_write_handle.close()

这段代码工作得很好，用于在以二进制模式打开的文件中硬编码文本搜索（参见 T_0008X_WEB 行）("rb ") 以避免 UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 776: character maps to undefined 错误。

但是，当尝试通过将硬编码值替换为像这样的变量 -if re.search('br\'(?i)' + regex_search_str_byte + '\'', file_read_content_mappd): 来搜索 值列表时，遇到了以下问题 -

使用时：re.search('br\'(?i)' + regex_search_str + '\'', file_read_content_mappd): 出错：文件是二进制文件，搜索文本是字符串类型
使用时：re.search(regex_search_str_byte, file_read_content_mappd): 遇到问题：找不到匹配项，因为即使是正则表达式字符 br'(?i) 也被视为字节转换的搜索文本

请求关于如何在二进制模式打开文件读取时执行字节转换的文本正则表达式搜索值列表的指导？

【问题讨论】：

看来你需要if re.search(str.encode(regex_search_str), file_read_content_mappd)
@WiktorStribiżew：在这种情况下，我们应该如何包含正则表达式标志 br'(?i) ？已尝试在上述案例 2 中执行相同操作，例如尝试将 包括正则表达式标志的整个值 保存到变量 regex_search_str 中并将该字符串转换为字节并保存在 regex_search_str_byte 中。我认为您对字符串编码到 UTF-8 选项的建议相同，但是，在这种情况下，它返回不匹配，我想字节转换的搜索文本也将正则表达式标志视为搜索文本的一部分。对此提出具体建议会更有帮助。
if re.search(str.encode(regex_search_str), file_read_content_mappd, flags=re.I)。该标志可以作为参数传递给re.search 方法。 br 不是必需的，因为它们用于修改字符串文字，并且您正在使用变量。我假设 regex_search_str 是一个 UTF8 字符串。见this question。
你是如何创建/分配regex_search_str_byte的？
regex_search_str_byte = bytes(each_string, 'utf-8')。 each_string 是另一个 python 字母数字字符列表的元素

标签： python regex search mmap

【解决方案1】：

使用

re.search(regex_search_str_byte, file_read_content_mappd, flags=re.I)

re.I 标志可以作为参数传递给re.search 方法。 br 前缀不是必需的，因为它们用于修改字符串文字，并且您正在使用变量。

【讨论】：