【问题标题】:Trying to flatten a Reddit JSON into many "conversations"试图将 Reddit JSON 扁平化为许多“对话”
【发布时间】:2020-01-04 15:17:21
【问题描述】:

我正在尝试使用 Reddit 线程的 cmets 作为机器学习程序的训练集。 https://old.reddit.com/r/bayarea/comments/cxxl9y/billionaires_yacht_docked_in_embarcadero.json 是一个输入示例。

我正在过滤掉 body、id 和 parent_id,希望将嵌套的 JSON 变成许多对话。

例如,如果输入为["A", ["B",["C", "D"]]],则输出应为["A", "B", "C"], ["A","B","D"]

以下是我当前的代码:

json_url = "https://old.reddit.com/r/bayarea/comments/cxxl9y/billionaires_yacht_docked_in_embarcadero.json"
            r = requests.get(json_url, headers={"user-agent": "PostmanRuntime/7.15.2"})

            comments_tree_raw = fltr(r.json(), ["ups", "body", "id", "parent_id"])[1]["data"]

            comments_tree_raw = flatten([], comments_tree_raw["children"])
def remove_all_after(node, index):
    target = node.index(index)
    return node[:target]




training_threads = []
# input the children list
def flatten(output, children):
    global training_threads


    for child in children:
        try:
            child_obj = child["data"] if "body" in child["data"] else child
            child_comment = {
                "body": child_obj["body"],
                "id": child_obj["id"],
                "parent": child_obj["parent_id"]
            }
            output.append(child_comment)
        except KeyError:
            continue

        if "replies" not in child["data"]:

            training_threads.append(output.copy())

            parent_id = child_comment["parent"].split("_")[1]
            for i in output:
                if i["id"] == parent_id:
                    output = remove_all_after(output, i)
                    break


            continue

        flatten(output, child["data"]["replies"]["data"]["children"])

在这里,我试图递归地解决问题,但它没有产生我需要的输出。这是我得到的输出:https://pastebin.com/GkpwGUtK

非常感谢您的帮助!谢谢!

【问题讨论】:

    标签: python list recursion reddit flatten


    【解决方案1】:

    您可以使用生成器的简单递归:

    data = ["A", ["B",["C", "D"]]]
    def group(d, c = []):
       a, b = d
       if all(not isinstance(i, list) for i in b):
         yield from [c+[a, i] for i in b]
       else:
         yield from group(b, c+[a])
    
    print(list(group(data)))
    

    输出:

    [['A', 'B', 'C'], ['A', 'B', 'D']]
    

    编辑:使用itertools.groupby的更完整版本:

    from itertools import groupby
    def group(d, c = []):
      _d = [list(b) for _, b in groupby(d, key=lambda x:isinstance(x, list))]
      if len(_d) == 1:
        for i in _d[0]:
          if not isinstance(i, list):
             yield c+[i]
          else:
             yield from group(i, c)
      else:
         for i in range(0, len(_d), 2):
           for k in _d[i]:
             yield from group(_d[i+1], c+[k])
    
    print(list(group([["C", ["D", "E"], ["C", ["D", "E"], ["C", ["D", "E"]]]]])))
    

    输出:

    [['C', 'D'], ['C', 'E'], ['C', 'C', 'D'], ['C', 'C', 'E'], ['C', 'C', 'C', 'D'], ['C', 'C', 'C', 'E']]
    

    【讨论】:

    • @VidurGupta 你能发布你的新输入和想要的输出吗?
    • @VidurGupta 这很奇怪,当我在这些输入样本上运行我的代码时,我实现了所需的输出,即["A", ["B",["C", "D", "E"]]] 我得到[['A', 'B', 'C'], ['A', 'B', 'D'], ['A', 'B', 'E']]。你能澄清一下吗?谢谢。
    • @VidurGupta 感谢您提供意见!在数据中,有输入列表只有一个元素,例如['Copy that.']。对于单个元素列表,您想要的输出是什么?
    • @VidurGupta 没问题,谢谢您的评论。但是,有时有些元素具有三个值,而不是一两个值,即[["C", ["D", "E"], ["C", ["D", "E"], ["C", ["D", "E"]]。您希望这些输出的格式如何?
    • @VidurGupta 谢谢你的评论。请查看我最近的编辑,因为我添加了一个更好地反映您的输入数据整体的解决方案。
    猜你喜欢
    • 2013-01-17
    • 2021-05-24
    • 1970-01-01
    • 2022-09-23
    • 2017-01-07
    • 1970-01-01
    • 2016-05-17
    • 2019-04-23
    • 2019-06-22
    相关资源
    最近更新 更多