如何更优雅地将字典列表转换为另一种格式？答案

【问题标题】：How to transform a list of dictionaries into another format more elegantly?如何更优雅地将字典列表转换为另一种格式？
【发布时间】：2020-05-21 11:56:53
【问题描述】：

我有一个 json 文件，其中包含一些关于单词的信息。该结构是一个带有 dicts 的列表，如下所示：

file = [{"index": "1", "text": "uhm", "eos": false}, {"index": "2", "text": "moeten", "eos": false}, {"index": "3", "text": "langs", "eos": false}, {"index": "4", "text": "uhm", "eos": true}, {"index": "1", "text": "uh", "eos": false}, {"index": "2", "text": "om", "eos": false}, {"index": "3", "text": "die", "eos": false}, {"index": "4", "text": "afsluiters", "eos": true}]

我需要对数据进行预处理以进行进一步分析。因此，我编写了以下函数。它工作正常，但看起来不是很优雅。如何改进它以使其更具可读性、更少冗余和美观 =)

def prepare(file):

    # set up variables
    text = []
    sent_dict = {}
    sentence = ""
    chunks = []
    ngram = ""
    maxn = 5

    for word in file:

        if word["eos"] == False:
            # concatenate words
            sentence += word["text"] + " "

            # get last five elements of sentence excluding last space and make chunk
            chunk = " ".join(sentence.split(" ")[:-1][-maxn:])
            index = word["index"]
            chunks.append({index: {"ngram" : chunk}})

        else:
           # concatenate words without last space
           sentence += word["text"]

           # get last five elements of sentence and make chunk
           chunk = " ".join(sentence.split(" ")[-maxn:])
           index = word["index"]
           chunks.append({index: {"ngram" : chunk}})

           # make dict with sentence and list of chunks
           sent_dict["sentence"] = sentence
           sent_dict["chunks"] = chunks
           text.append(sent_dict)

           # set variables back to default
           sent_dict = {}
           sentence = ""
           chunks = []

    return(text)

如果你编译prepare(file)，它会返回一个类似如下的列表：

[{'sentence' : 'uhm moeten langs uhm', 'chunk' : [{'1' : 'uhm'}, {'2' : 'uhm moeten'}, {'3' : 'uhm moeten langs'}, {'4' : 'uhm moeten langs uhm'}]}]

【问题讨论】：

请向我们展示您编写的函数的示例输出。还向我们展示您想要的输出示例。
如果您将问题顶部的列表插入到函数中，它将完全返回我想要的输出。这是一个工作示例
是的，但无论如何请发布一个示例。很多人不看代码就能想出解决方案。

标签： python list dictionary for-loop

【解决方案1】：

我假设每个句子都有 4 个块。如果不是这种情况，我确定您可以轻松调整我的代码，但现在它的硬编码为 4 个项目。不过，这绝对可以改变。我决定输出该信息的方式是在一个列表中。对我来说，我的薄列表比字典更容易使用和玩耍，因此我制作它的方式如下：这将是一个列表，其中包含诸如

之类的项目

sentence,uhm moeten langs uhm : sentence is made up of the following chunks : 1,uhm : 2,uhm moeten : 3,uhm moeten langs : 4,uhm moeten langs uhm

列表中的下一项将是

sentence,uh om die afsluiters : sentence is made up of the following chunks : 1,uh : 2,uh om : 3,uh om die : 4,uh om die afsluiters

我这样做的原因是因为它很容易拆分，您可以轻松获得所需的每个项目，例如您可以拆分

" : "

然后你可以循环并拆分

","

得到非常的项目。

你的代码最终对我来说如下所示。

def prepare(file):

    # set up variables
    text = []
    sent_dict = {}
    sentence = ""
    chunks = []
    ngram = ""
    maxn = 5

    for word in file:

        if word["eos"] == False:
            # concatenate words
            sentence += word["text"] + " "


            chunk = " ".join(sentence.split(" ")[:-1][-maxn:])
            index = word["index"]
            chunks.append({index: {"ngram" : chunk}})

        else:

            sentence += word["text"]

            chunk = " ".join(sentence.split(" ")[-maxn:])
            index = word["index"]
            chunks.append({index: {"ngram" : chunk}})

            sent_dict["sentence"] = sentence
            sent_dict["chunks"] = chunks
            text.append(sent_dict)

            sent_dict = {}
            sentence = ""
            chunks = []

    return(text)



file = [{"index": "1", "text": "uhm", "eos": False}, {"index": "2", "text": "moeten", "eos": False}, {"index": "3", "text": "langs", "eos": False}, {"index": "4", "text": "uhm", "eos": True}, {"index": "1", "text": "uh", "eos": False}, {"index": "2", "text": "om", "eos": False}, {"index": "3", "text": "die", "eos": False}, {"index": "4", "text": "afsluiters", "eos": True}]



final_list = []
x = (prepare(file))
for i in x:
    new_string = "sentence,{} : sentence is made up of the following chunks : 1,{} : 2,{} : 3,{} : 4,{}".format(i["sentence"], i["chunks"][0]["1"]["ngram"], i["chunks"][1]["2"]["ngram"], i["chunks"][2]["3"]["ngram"], i["chunks"][3]["4"]["ngram"])
    final_list.append(new_string)

请记住，以我的方式格式化的项目列表称为 final_list。如果您循环并打印每个项目，您将看到我向您展示的内容。希望这更容易使用。

【讨论】：

请记住，在分割原始字符串后，输出中“句子由以下块组成”的部分也可以很容易地删除，这是因为它总是在位置一拆分列表，因此您可以在需要时每次都将其弹出。希望代码有所帮助。
感谢您的回答。这个想法很好，似乎对某些任务很有用。但是，我需要指定的输出并且不想改进输出格式，而是返回该输出的函数。此外，句子可以包含超过 4 个块，并且块的数量因句子而异。