【问题标题】:Parsing a string with nested quotes解析带有嵌套引号的字符串
【发布时间】:2019-03-14 17:04:50
【问题描述】:

我需要解析一个如下所示的字符串:

"prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"

我想得到如下列表:

['field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']

我用正则表达式试过了,但在伪SQL语句的子字符串中不起作用。

我怎样才能得到这份清单?

【问题讨论】:

  • 当您将底部示例放入脚本时,它没有正确终止,所以我不知道您想如何将其切碎。 'column3 And' 后面还有一个流氓 " 表示有 3 个双引号。这些应该如何排列?
  • 我修复了字符串。我想得到一个列表,其中字符串中的每个字段都是列表中的一个字段,包括子字符串以 select 开头并以 LIMIT 0 结尾。
  • SQL 部分是否正确?它包含OVERLAPS column3 And" 中的流氓报价。另外,字段的数量是恒定的吗?
  • 如果输入字符串包含 arbitratry sql 语句,我认为这是不可能的,因为这些语句可能包含任意数量的嵌入式引号和逗号。
  • 如果字段数不变,可以将查询左侧的字段和右侧的字段提取出来,剩下的就是sql查询了。

标签: python quotes string-parsing


【解决方案1】:

如果您知道 SQL 字符串应该是什么样子,这是一种简单的方法。

我们匹配 SQL 字符串,并将其余部分拆分为开始和结束字符串。

然后我们匹配更简单的字段模式并从开始为该模式构建一个列表,添加回 SQL 匹配,然后是结束字符串中的字段。

sqlmatch = 'select .* LIMIT 0'
fieldmatch = "'(|\w+)'"
match = re.search(sqlmatch, mystring)
startstring = mystring[:match.start()]
sql = mystring[match.start():match.end()]
endstring = mystring[match.end():]
result = []
for found in re.findall(fieldmatch, startstring):
    result.append(found)

result.append(sql)
for found in re.findall(fieldmatch, endstring):
    result.append(found)

那么结果列表如下所示:

['field1',
 '',
 'field2',
 'field3',
 'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\') 
OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\')  LIMIT 0',
 'field5',
 'field6',
 'field7',
 'field8',
 'field9',
 '',
 'field10']

【讨论】:

  • 这不会根据示例保留空字符串(尽管这并不一定意味着 OP 需要它们)
【解决方案2】:

由于字段数是固定的,非sql字段没有嵌入引号,所以有一个简单的三行解决方案:

prefix, other = string.partition(' ')[::2]
fields = other.strip('\'').split('\', \'')
fields[4:-7] = [''.join(fields[4:-7])]

print(fields)

输出:

['field1', '', 'field2', 'field3', "select ... where (column1 = '2017') and ((('literal1literal2literal3literal4literal5literal6literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ", 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10']

【讨论】:

    【解决方案3】:

    有人指出你的字符串格式错误,我用这个:

    mystr = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
    
    found = [a.replace("'", '').replace(',', '') for a in mystr.split(' ') if "'" in a]
    

    返回:

    ['field1',
     '',
     'field2',
     'field3',
     'select',
     '2017)',
     '(((literal1',
     'literal2',
     'literal3',
     'literal4',
     'literal5',
     'literal6',
     'literal7)',
     '(literal8)',
     'literal9)',
     '',
     'field5',
     'field6',
     'field7',
     'field8',
     'field9',
     '',
     'field10']
    

    【讨论】:

    • 您的输出看起来不像 OP 要求的那样。 select 语句根本不应该被打断。
    【解决方案4】:

    如果字段数不变,您可以这样做:

    def splitter(string):
        strip_chars = "\"' "
        string = string[len('prefix '):] # remove the prefix
        left_parts = string.split(',', 4) # only split up to 4 times
        for i in left_parts[:-1]:
            yield i.strip(strip_chars) # return what we've found so far
        right_parts = left_parts[-1].rsplit(',', 7) # only split up to 7 times starting from the right
        for i in right_parts:
            yield i.strip(strip_chars) # return the rest
    
    mystr = """prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"""
    result = list(splitter(mystr))
    print(repr(result))
    
    
    # result:
    [
        'field1',
        '',
        'field2',
        'field3',
        'select ... where (column1 = \'2017\') and (((\'literal1\', \'literal2\', \'literal3\', \'literal4\', \'literal5\', \'literal6\', \'literal7\') OVERLAPS column2 Or (\'literal8\') OVERLAPS column3 And" (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = \'literal9\')  LIMIT 0',
        'field5',
        'field6',
        'field7',
        'field8',
        'field9',
        '',
        'field10'
    ]
    

    【讨论】:

    • 如果其余字段可以包含逗号,那么它需要一个解析器而不是简单的拆分。
    【解决方案5】:

    实际在字段之间的逗号分隔符将处于偶数引号级别。因此,通过将这些逗号更改为 \n 字符,您可以在字符串上应用一个简单的 .split("\n") 来获取字段值。然后,您只需清理字段值即可删除前导/尾随空格和引号。

    from itertools import accumulate
    
    string      = "prefix 'field1', '', 'field2', 'field3', 'select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 ', 'field5', 'field6', 'field7', 'field8', 'field9', '', 'field10'"
    prefix,data = string.split(" ",1)                   # remove prefix
    quoteLevels = accumulate( c == "'" for c in data )  # compute quote levels for each character
    fieldData   = "".join([ "\n" if c=="," and q%2 == 0 else c for c,q in zip(data,quoteLevels) ]) # comma to /n at even quote levels
    fields      = [ f.strip().strip("'") for f in fieldData.split("'\n '") ] # split and clean content
    
    for i,field in enumerate(fields): print(i,field)
    

    这将打印:

    0 field1
    1 
    2 field2
    3 field3
    4 select ... where (column1 = '2017') and ((('literal1', 'literal2', 'literal3', 'literal4', 'literal5', 'literal6', 'literal7') OVERLAPS column2 Or ('literal8') OVERLAPS column3 And (column4 > 0.0 Or column6 > 0.0)) And column7 IN_COMMUNITY [int1] And column5 = 'literal9')  LIMIT 0 
    5 field5
    6 field6
    7 field7
    8 field8
    9 field9
    10 
    11 field10
    

    【讨论】:

    • 这在字段包含引号或逗号的情况下不起作用。例如。如果 'literal1' 改为 'bob,dole''bob\'s'
    • 只有当嵌入的引用文本包含逗号或字段包含不平衡的引用时,才会出现问题。仅包含逗号的字段不会有问题。在任何情况下,超出所提供的示例进行推测是没有意义的,因为数据格式充满了潜在的不一致性,无法支持完整的文本内容
    • 我并不反对,我只是认为值得注意的是解决方案的局限性:如果其他需要这些功能的人出现,它将帮助他们避免走错路。和it will fail if any of the literal fields contain a comma。它依赖于那些不包含逗号的文字字符串,因为解析在那个深度是关闭的。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-02-24
    • 2021-01-22
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多