将子字符串转换为字典答案

【问题标题】：convert substrings to dict将子字符串转换为字典
【发布时间】：2018-07-25 21:43:10
【问题描述】：

寻找一种优雅的方式将子字符串列表和它们之间的文本转换为字典中的键值对。示例：

s = 'k1:some text k2:more text k3:and still more'
key_list = ['k1','k2','k3']
(missing code)
# s_dict = {'k1':'some text', 'k2':'more text', 'k3':'and still more'}

这可以使用str.find() 等解决，但我知道有比我一起破解的更好的解决方案。

【问题讨论】：

那么，关键是不是没有空格的单词？
@cᴏʟᴅsᴘᴇᴇᴅ 认为它们是从列表中知道的（没有:） - 我已经编辑了代码。
嗯，要弄清楚一个值在哪里结束而下一个键以代码开头并不容易。
s = 'k1:some text k2:I was on the k1 once k3:and still more' ?
@PatrickArtner 想解析为{... k2:'I was on the k1 once' ...}。如果 : 是保留语法，则仍然定义良好。

标签： python string dictionary

【解决方案1】：

选项 1
如果键没有空格或冒号，您可以使用dict + re.findall（import re，首先）简化您的解决方案：

>>> dict(re.findall('(\S+):(.*?)(?=\s\S+:|$)', s))
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}

只有冒号 (:) 的位置决定了键/值的匹配方式。

详情

(\S+)   # match the key (anything that is not a space)
:       # colon (not matched)
(.*?)   # non-greedy match - one or more characters - this matches the value 
(?=     # use lookahead to determine when to stop matching the value
\s      # space
\S+:    # anything that is not a space followed by a colon 
|       # regex OR
$)      # EOL

请注意，此代码采用问题中提出的结构。它会在结构无效的字符串上失败。

选项 2
看，没有正则表达式...
这与上述假设相同。

在冒号上拆分 (:)
除了第一个和最后一个元素之外的所有元素都需要在空间上再次拆分（以分隔键和值）
压缩相邻元素，并转换成字典

v = s.split(':')
v[1:-1] = [j for i in v[1:-1] for j in i.rsplit(None, 1)]

dict(zip(v[::2], v[1::2]))
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}

【讨论】：

令人印象深刻。 regexp 是否仍然会让你的眼睛流血，还是会随着时间的推移而好转？ :)
@ConfusinglyCuriousTheThird 正则表达式有它的时刻。要玩火，您要么必须是纵火狂，要么准备好被烧伤。 ;)
@ConfusinglyCuriousTheThird 为您添加了选项 2。
这个不错！我希望我可以用key: 分隔，所以值中的: 不会把事情搞砸......
哈哈 :) 我猜正则表达式排在一长串要做的事情的末尾！

【解决方案2】：

如果键中没有空格或冒号，您可以：

根据 alpha 后跟冒号进行拆分以获取令牌
在字典理解中压缩半移位切片以重建字典

像这样：

import re,itertools
s = 'k1:some text k2:more text k3:and still more'
toks = [x for x in re.split("(\w+):",s) if x]  # we need to filter off empty tokens
# toks => ['k1', 'some text ', 'k2', 'more text ', 'k3', 'and still more']
d = {k:v for k,v in zip(itertools.islice(toks,None,None,2),itertools.islice(toks,1,None,2))}
print(d)

结果：

{'k2': 'more text ', 'k1': 'some text ', 'k3': 'and still more'}

使用itertools.islice 可以避免创建像toks[::2] 这样的子列表

【讨论】：

【解决方案3】：

另一个 regex 魔术，将输入字符串拆分为 key/value 对：

import re

s = 'k1:some text k2:more text k3:and still more'
pat = re.compile(r'\s+(?=\w+:)')
result = dict(i.split(':') for i in pat.split(s))

print(result)

输出：

{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}

当表达式将在单个程序中多次使用时，使用 re.compile() 并保存生成的正则表达式对象以供重复使用会更有效
\s+(?=\w+:) - 如果后面跟着一个 "key"（单词 \w+ 和冒号 :），则用空格字符 \s+ 分割输入字符串的关键模式。(?=...) - 代表前瞻肯定断言

【讨论】：

这是一个非常好的解决方案。感谢您的正则表达式解释。

【解决方案4】：

如果您有一个已知键的列表（也许还有值，但我没有在这个答案中解决这个问题），您可以使用正则表达式来完成。例如，如果您可以简单地断言冒号前的最后一个空格肯定表示键的开头，则可能有一个捷径，但这也应该起作用：

import re

s = 'k1:some text k2:more text k3:and still more'
key_list = ['k1', 'k2', 'k3']
dict_splitter = re.compile(r'(?P<key>({keys})):(?P<val>.*?)(?=({keys})|$)'.format(keys=')|('.join(key_list)))
result = {match.group('key'): match.group('val') for match in dict_splitter.finditer(s)}
print(result)
>> {'k1': 'some text ', 'k2': 'more text ', 'k3': 'and still more'}

解释：

(?P<key>({keys}))  # match all the defined keys, call that group 'key'
:                  # match a colon
(?P<val>.*?)       # match anything that follows and call it 'val', but
                   # only as much as necessary..
(?=({keys})|$)     # .. as long as whatever follows is either a new key or 
                   # the end of the string
.format(keys=')|('.join(key_list))
                   # build a string out of the keys where all the keys are
                   # 'or-chained' after one another, format it into the
                   # regex wherever {keys} appears.

警告 1：如果您的键可以相互包含，则顺序很重要，您可能希望从长键变为短键，以便首先强制最长匹配：key_list.sort(key=len, reverse=True)

警告 2：如果您的密钥列表包含正则表达式元字符，则会破坏表达式，因此可能需要先对其进行转义：key_list = [re.escape(key) for key in key_list]

【讨论】：

【解决方案5】：

这个版本有点冗长但直截了当，它不需要任何库并考虑到key_list：

def substring_to_dict(text, keys, key_separator=':', block_separator=' '):
    s_dict = {}
    current_key = None

    for block in text.split(block_separator):
        if key_separator in block:
            key, word = block.split(key_separator, 1)
            if key in keys:
                current_key = key
                block = word
        if current_key:
            s_dict.setdefault(current_key, []).append(block)

    return {key:block_separator.join(s_dict[key]) for key in s_dict}

这里有一些例子：

>>> keys = {'k1','k2','k3'}
>>> substring_to_dict('k1:some text k2:more text k3:and still more', keys)
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}
>>> substring_to_dict('k1:some text k2:more text k3:and still more k4:not a key', keys)
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more k4:not a key'}
>>> substring_to_dict('', keys)
{}
>>> substring_to_dict('not_a_key:test', keys)
{}
>>> substring_to_dict('k1:k2:k3 k2:k3:k1', keys)
{'k1': 'k2:k3', 'k2': 'k3:k1'}
>>> substring_to_dict('k1>some;text;k2>more;text', keys, '>', ';')
{'k1': 'some;text', 'k2': 'more;text'}

【讨论】：

【解决方案6】：

这不是一个好主意，但为了完整起见，在这种情况下也可以选择使用ast.literal_eval：

from ast import literal_eval
s = 'k1:some text k2:more text k3:and still more'
key_list = ['k1','k2','k3']
s_ = s
for k in key_list:
            s_ = s_.replace('{}:'.format(k), '","{}": "'.format(k))

s_dict = literal_eval('{{{}"}}'.format(s_[2:]))

print(s_dict)

输出：

{'k1': 'some text ', 'k2': 'more text ', 'k3': 'and still more'}

【讨论】：