Python regex groupdict 返回单个字符而不是组的字符串答案

【问题标题】：Python regex groupdict returns single characters instead of strings for groupsPython regex groupdict 返回单个字符而不是组的字符串
【发布时间】：2016-04-05 09:11:27
【问题描述】：

我遇到了一个非常令人困惑的问题，即 Python 中的正则表达式匹配。我有一对在 regex101 等调试工具中运行良好的正则表达式模式：

[Hex&Oct matching Pattern]（测试窗口中的代码与控制台测试中的文件内容相同）
[Base64 matching Pattern]（远非理想，但至少 15 个字符有助于避免误报）
[Hex|Oct splitting Pattern]（Hex&Oct 与不同命名组的变化）

但是，一旦在脚本中实现，模式将无法匹配任何内容，除非在开始引号之前编译并添加 r。

即便如此，匹配项仍会从组字典中返回单个字符。

谁能提供任何关于我在这里做错了什么的指针？

deobf.py:

#!/bin/python
import sys
import getopt
import re
import base64

####################################################################################
#
# Setting up global vars and functions
#
####################################################################################

# Assemble Pattern Dictionary
pattern={}
pattern["HexOct"]=re.compile(r'([\"\'])(?P<obf_code>(\\[xX012]?[\dA-Fa-f]{2})*)\1')
pattern["Base64"]=re.compile(r'([\"\'])(?P<obf_code>[\dA-Za-z\/\+]{15,}={0,2})\1')

# Assemble more precise Pattern handling:
sub_pattern={}
sub_pattern["HexOct"]=re.compile(r'((?P<Hex>\\[xX][\dA-Fa-f]{2})|(?P<Oct>\\[012]?[\d]{2}))')

#print pattern # trying to Debug Pattern Dicts
#print sub_pattern # trying to Debug Pattern Dicts

# Global Var init
file_in=""
file_out=""
code_string=""
format_code = False

# Prints the Help screen
def usage():
    print "How to use deobf.py:"
    print "-----------------------------------------------------------\n"
    print "$ python deobf.py -i {inputfile.php} [-o {outputfile.txt}]\n"
    print "Other options include:"
    print "-----------------------------------------------------------"
    print "-f : Format - Format the output code with indentations"
    print "-h : Help - Prints this info\n"
    print "-----------------------------------------------------------"
    print "You can also use the long forms:"
    print "-i : --in"
    print "-o : --out"
    print "-f : --format"
    print "-h : --Help"

# Combination wrapper for the above two functions
def deHexOct(obf_code):
    match = re.search(sub_pattern["HexOct"],obf_code)
    if match:

        # Find and process Hex obfuscated elements
        for HexObj in match.groupdict()["Hex"]:
            print match.groupdict()["Hex"]
            print "Processing:"
            print HexObj.pattern
            obf_code.replace(HexObj,chr(int(HexObj),16))

        # Find and process Oct obfuscated elements
        for OctObj in set(match.groupdict()["Oct"]):
            print "Processing:"
            print OctObj
            obf_code.replace(OctObj,chr(int(OctObj),8))
    return obf_code

# Crunch the Data
def deObfuscate(file_string):
    # Identify HexOct sections and process
    match = re.search(pattern["HexOct"],file_string)
    if match:
        print "HexOct Obfuscation found."
        for HexOctObj in match.groupdict()["obf_code"]:
            print "Processing:"
            print HexOctObj
            file_string.replace(HexOctObj,deHexOct(HexOctObj))

    # Identify B64 sections and process
    match = re.search(pattern["Base64"],file_string)
    if match:
        print "Base64 Obfuscation found."
        for B64Obj in match.groupdict()["obf_code"]:
            print "Processing:"
            print B64Obj
            file_string.replace(B64Obj,base64.b64decode(B64Obj))

    # Return the (hopefully) deobfuscated string
    return file_string

# File to String
def loadFile(file_path):
    try:
        file_data = open(file_path)
        file_string = file_data.read()
        file_data.close()
        return file_string
    except ValueError,TypeError:
        print "[ERROR] Problem loading the File: " + file_path

# String to File
def saveFile(file_path,file_string):
    try:
        file_data = open(file_path,'w')
        file_data.write(file_string)
        file_data.close()
    except ValueError,TypeError:
        print "[ERROR] Problem saving the File: " + file_path

####################################################################################
#
# Main body of Script
#
####################################################################################
# Getting the args
try:
    opts, args = getopt.getopt(sys.argv[1:], "hi:o:f", ["help","in","out","format"])
except getopt.GetoptError:
    usage()
    sys.exit(2)

# Handling the args
for opt, arg in opts:
    if opt in ("-h", "--help"):
        usage()
        sys.exit()
    elif opt in ("-i", "--in"):
        file_in = arg
        print "Designated input file: "+file_in
    elif opt in ("-o", "--out"):
        file_out = arg
        print "Designated output file: "+file_out
    elif opt in ("-f", "--format"):
        format_code = True
        print "Code Formatting mode enabled"

# Checking the input   
if file_in =="":
    print "[ERROR] - No Input File Specified"
    usage()
    sys.exit(2)

# Checking or assigning the output
if file_out == "":
    file_out = file_in+"-deObfuscated.txt"
    print "[INFO] - No Output File Specified - Automatically assigned: "+file_out

# Zhu Li, Do the Thing!
code_string=loadFile(file_in)
deObf_String=deObfuscate(str(code_string))
saveFile(file_out,deObf_String)

我的调试打印的控制台输出如下：

C:\Users\NJB\workspace\python\deObf>deobf.py -i "Form 5138.php"
Designated input file: Form 5138.php
[INFO] - No Output File Specified - Automatically assigned: Form 5138.php-deObfuscated.txt
HexOct Obfuscation found.
Processing:
\
Processing:
x
Processing:
6
Processing:
1
Processing:
\
Processing:
1
Processing:
5
Processing:
6
Processing:
\
Processing:
x
Processing:
7
Processing:
5
Processing:
\
Processing:
1
Processing:
5
Processing:
6
Processing:
\
Processing:
x
Processing:
6
Processing:
1

【问题讨论】：

标签： python regex string character-encoding deobfuscation

【解决方案1】：

您的正则表达式可以很好地匹配组，但是您正在遍历匹配组中的字符。

这给出了你刚刚匹配的字符串：match.groupdict()["Hex"]

这会遍历字符串中的字符：

for HexObj in match.groupdict()["Hex"]:

您想要迭代搜索，因此请使用re.finditer() 而不是re.search()。所以像：

def deHexOct(obf_code):
    for match in re.finditer(sub_pattern["HexOct"],obf_code):
        # Find and process Hex obfuscated elements
        groups = match.groupdict()
        hex = groups["Hex"]
        if hex:
            print "hex:", hex
            # do processing here
        oct = groups["Oct"]
        if oct:
            print "oct:", oct 
            # do processing here

此外，字符串前面的r 只是阻止 Python 将反斜杠解释为转义，并且正则表达式需要它，因为它们也使用反斜杠进行转义。另一种方法是将正则表达式中的每个反斜杠加倍；那么您就不需要 r 前缀，但正则表达式可能会变得更不可读。

【讨论】：

感谢您指出字符问题，但不幸的是，我收到 findall 错误，指出返回的匹配对象没有方法 groupdict()。有趣的是，打印它会发现它是一个没有键的字典。我用 finditer() 运气好一点，但仍然摸不着头脑。
抱歉，应该说是“finditer”。 finditer 返回匹配对象，findall 只返回字符串。
有了这个改变，我有了功能！现在我只需要减少我的工作到最干燥的解决方案。