【问题标题】:Python regex groupdict returns single characters instead of strings for groupsPython regex groupdict 返回单个字符而不是组的字符串
【发布时间】:2016-04-05 09:11:27
【问题描述】:

我遇到了一个非常令人困惑的问题,即 Python 中的正则表达式匹配。 我有一对在 regex101 等调试工具中运行良好的正则表达式模式:

但是,一旦在脚本中实现,模式将无法匹配任何内容,除非在开始引号之前编译并添加 r

即便如此,匹配项仍会从组字典中返回单个字符。

谁能提供任何关于我在这里做错了什么的指针?

deobf.py:

#!/bin/python
import sys
import getopt
import re
import base64

####################################################################################
#
# Setting up global vars and functions
#
####################################################################################

# Assemble Pattern Dictionary
pattern={}
pattern["HexOct"]=re.compile(r'([\"\'])(?P<obf_code>(\\[xX012]?[\dA-Fa-f]{2})*)\1')
pattern["Base64"]=re.compile(r'([\"\'])(?P<obf_code>[\dA-Za-z\/\+]{15,}={0,2})\1')

# Assemble more precise Pattern handling:
sub_pattern={}
sub_pattern["HexOct"]=re.compile(r'((?P<Hex>\\[xX][\dA-Fa-f]{2})|(?P<Oct>\\[012]?[\d]{2}))')

#print pattern # trying to Debug Pattern Dicts
#print sub_pattern # trying to Debug Pattern Dicts

# Global Var init
file_in=""
file_out=""
code_string=""
format_code = False

# Prints the Help screen
def usage():
    print "How to use deobf.py:"
    print "-----------------------------------------------------------\n"
    print "$ python deobf.py -i {inputfile.php} [-o {outputfile.txt}]\n"
    print "Other options include:"
    print "-----------------------------------------------------------"
    print "-f : Format - Format the output code with indentations"
    print "-h : Help - Prints this info\n"
    print "-----------------------------------------------------------"
    print "You can also use the long forms:"
    print "-i : --in"
    print "-o : --out"
    print "-f : --format"
    print "-h : --Help"

# Combination wrapper for the above two functions
def deHexOct(obf_code):
    match = re.search(sub_pattern["HexOct"],obf_code)
    if match:

        # Find and process Hex obfuscated elements
        for HexObj in match.groupdict()["Hex"]:
            print match.groupdict()["Hex"]
            print "Processing:"
            print HexObj.pattern
            obf_code.replace(HexObj,chr(int(HexObj),16))

        # Find and process Oct obfuscated elements
        for OctObj in set(match.groupdict()["Oct"]):
            print "Processing:"
            print OctObj
            obf_code.replace(OctObj,chr(int(OctObj),8))
    return obf_code

# Crunch the Data
def deObfuscate(file_string):
    # Identify HexOct sections and process
    match = re.search(pattern["HexOct"],file_string)
    if match:
        print "HexOct Obfuscation found."
        for HexOctObj in match.groupdict()["obf_code"]:
            print "Processing:"
            print HexOctObj
            file_string.replace(HexOctObj,deHexOct(HexOctObj))

    # Identify B64 sections and process
    match = re.search(pattern["Base64"],file_string)
    if match:
        print "Base64 Obfuscation found."
        for B64Obj in match.groupdict()["obf_code"]:
            print "Processing:"
            print B64Obj
            file_string.replace(B64Obj,base64.b64decode(B64Obj))

    # Return the (hopefully) deobfuscated string
    return file_string

# File to String
def loadFile(file_path):
    try:
        file_data = open(file_path)
        file_string = file_data.read()
        file_data.close()
        return file_string
    except ValueError,TypeError:
        print "[ERROR] Problem loading the File: " + file_path

# String to File
def saveFile(file_path,file_string):
    try:
        file_data = open(file_path,'w')
        file_data.write(file_string)
        file_data.close()
    except ValueError,TypeError:
        print "[ERROR] Problem saving the File: " + file_path

####################################################################################
#
# Main body of Script
#
####################################################################################
# Getting the args
try:
    opts, args = getopt.getopt(sys.argv[1:], "hi:o:f", ["help","in","out","format"])
except getopt.GetoptError:
    usage()
    sys.exit(2)

# Handling the args
for opt, arg in opts:
    if opt in ("-h", "--help"):
        usage()
        sys.exit()
    elif opt in ("-i", "--in"):
        file_in = arg
        print "Designated input file: "+file_in
    elif opt in ("-o", "--out"):
        file_out = arg
        print "Designated output file: "+file_out
    elif opt in ("-f", "--format"):
        format_code = True
        print "Code Formatting mode enabled"

# Checking the input   
if file_in =="":
    print "[ERROR] - No Input File Specified"
    usage()
    sys.exit(2)

# Checking or assigning the output
if file_out == "":
    file_out = file_in+"-deObfuscated.txt"
    print "[INFO] - No Output File Specified - Automatically assigned: "+file_out

# Zhu Li, Do the Thing!
code_string=loadFile(file_in)
deObf_String=deObfuscate(str(code_string))
saveFile(file_out,deObf_String)

我的调试打印的控制台输出如下:

C:\Users\NJB\workspace\python\deObf>deobf.py -i "Form 5138.php"
Designated input file: Form 5138.php
[INFO] - No Output File Specified - Automatically assigned: Form 5138.php-deObfuscated.txt
HexOct Obfuscation found.
Processing:
\
Processing:
x
Processing:
6
Processing:
1
Processing:
\
Processing:
1
Processing:
5
Processing:
6
Processing:
\
Processing:
x
Processing:
7
Processing:
5
Processing:
\
Processing:
1
Processing:
5
Processing:
6
Processing:
\
Processing:
x
Processing:
6
Processing:
1

【问题讨论】:

    标签: python regex string character-encoding deobfuscation


    【解决方案1】:

    您的正则表达式可以很好地匹配组,但是您正在遍历匹配组中的字符。

    这给出了你刚刚匹配的字符串:match.groupdict()["Hex"]

    这会遍历字符串中的字符:

    for HexObj in match.groupdict()["Hex"]:
    

    您想要迭代搜索,因此请使用re.finditer() 而不是re.search()。所以像:

    def deHexOct(obf_code):
        for match in re.finditer(sub_pattern["HexOct"],obf_code):
            # Find and process Hex obfuscated elements
            groups = match.groupdict()
            hex = groups["Hex"]
            if hex:
                print "hex:", hex
                # do processing here
            oct = groups["Oct"]
            if oct:
                print "oct:", oct 
                # do processing here
    

    此外,字符串前面的r 只是阻止 Python 将反斜杠解释为转义,并且正则表达式需要它,因为它们也使用反斜杠进行转义。另一种方法是将正则表达式中的每个反斜杠加倍;那么您就不需要 r 前缀,但正则表达式可能会变得更不可读。

    【讨论】:

    • 感谢您指出字符问题,但不幸的是,我收到 findall 错误,指出返回的匹配对象没有方法 groupdict()。有趣的是,打印它会发现它是一个没有键的字典。我用 finditer() 运气好一点,但仍然摸不着头脑。
    • 抱歉,应该说是“finditer”。 finditer 返回匹配对象,findall 只返回字符串。
    • 有了这个改变,我有了功能!现在我只需要减少我的工作到最干燥的解决方案。
    猜你喜欢
    • 2020-10-23
    • 1970-01-01
    • 1970-01-01
    • 2019-08-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-09-20
    • 1970-01-01
    相关资源
    最近更新 更多