【问题标题】:Convert Nested Json to CSV Python将嵌套的 Json 转换为 CSV Python
【发布时间】:2021-05-17 18:50:57
【问题描述】:

我正在尝试将复杂的 json(嵌套格式)转换为 csv

{
"caudal": [
{"ts": 1612746051248, "value": "0.0"}, 
{"ts": 1612745450856, "value": "0.0"}, 
{"ts": 1612744250898, "value": "0.0"}, 
{"ts": 1612743650861, "value": "0.0"}, 
{"ts": 1612743050821, "value": "0.0"} 
], 
"FreeHeap": [
{"ts": 1612746051248, "value": "247564"}, 
{"ts": 1612745450856, "value": "247564"}, 
{"ts": 1612744250898, "value": "247564"}, 
{"ts": 1612743650861, "value": "247564"}, 
{"ts": 1612743050821, "value": "247564"} 
], 
"MinimoFreeHeap": [
{"ts": 1612746051248, "value": "237440"}, 
{"ts": 1612745450856, "value": "237440"}, 
{"ts": 1612744250898, "value": "237440"}, 
{"ts": 1612743650861, "value": "237440"}, 
{"ts": 1612743050821, "value": "237440"} 
]
} 

我的程序必须处理的jsons包含更多的记录,但是为了简化分析我把它变小了。我尝试使用pandas库如下:

import pandas as pd

with open('read.json') as f_input:
    df = pd.read_json(f_input)

df.to_csv('out.csv', encoding='utf-8', index=False)

我得到以下结果:

caudal,FreeHeap,MinimoFreeHeap
"{'ts': 1612746051248, 'value': '0.0'}","{'ts': 1612746051248, 'value': '247564'}","{'ts': 1612746051248, 'value': '237440'}"
"{'ts': 1612745450856, 'value': '0.0'}","{'ts': 1612745450856, 'value': '247564'}","{'ts': 1612745450856, 'value': '237440'}"
"{'ts': 1612744250898, 'value': '0.0'}","{'ts': 1612744250898, 'value': '247564'}","{'ts': 1612744250898, 'value': '237440'}"
"{'ts': 1612743650861, 'value': '0.0'}","{'ts': 1612743650861, 'value': '247564'}","{'ts': 1612743650861, 'value': '237440'}"
"{'ts': 1612743050821, 'value': '0.0'}","{'ts': 1612743050821, 'value': '247564'}","{'ts': 1612743050821, 'value': '237440'}"

如您所见,信息是每个单元格例如:

"{'ts': 1612743050821, 'value': '247564'}"

我理解的是另一个Json..有什么简单的方法可以添加一个名为timestamp(ts)的列,并且只将值放在这个json现在所在的单元格中吗? 我相信这将是正确的方法,我的目标是将 json 中包含的信息转换为 csv 格式,以使其更易于被第三方(数据库或人工智能算法)使用。但如果你能想到另一种更方便的方式或格式,我愿意改变我最初的想法。我不得不承认我对这个世界很陌生。

我想过通过 json 并手动进行转换,但很难关联具有相同时间戳的测量值。

【问题讨论】:

    标签: python json pandas csv


    【解决方案1】:

    尼古拉斯

    您没有说您想要数据的方式,因此下面发布的代码将其转换为表格格式,每个列用于机器(不确定是否正确)、ts 和值。

    import pandas as pd
    import json
    
    with open('read.json') as f_input:
        data = json.load(f_input)
    
    df = pd.DataFrame.from_dict(data, orient='columns')
    
    df_new = pd.DataFrame(columns=['machine', 'ts', 'value'])
    data=[]
    
    for col in df.columns:
      for index,row in df[col].iteritems():
        ts, value = row.values()
        data.append({'machine':col, 'ts':ts, 'value':value})
        
    df_new = df_new.append(data)
    
    df_new.to_csv('out.csv', encoding='utf-8', index=False)
    

    如果您希望列成为时间戳并且机器将最后两行更改为此

    df_new = df_new.append(data).pivot(index='ts', columns='machine', values='value')
    
    df_new.to_csv('out.csv', encoding='utf-8')
    

    【讨论】:

      【解决方案2】:
      • 根据此questiontiming analysispd.DataFrame(df[col].values.tolist()) 是从列中规范化单个级别dict 的最快方法,但此answer 显示如何处理有问题的列(例如尝试.values.tolist()时会导致错误。
      import pandas as pd
      
      # read the json file
      with open('read.json') as f_input:
          df = pd.read_json(f_input)
      
      # create a new dataframe for the normalized columns from df
      normed_df = pd.DataFrame()
      
      # iterate through each column, normalize it, and append it to normed_df
      for col in df.columns:
          normed = pd.DataFrame(df[col].values.tolist())  # normalize the column from df
          normed['type'] = col  # add the original column name as a new column so the associated values can be identified
          normed_df = normed_df.append(normed)  # append to normed_df
      
      # convert ts to a datetime dtype
      normed_df.ts = pd.to_datetime(normed_df.ts, unit='ms')
      
      # reset the index
      normed_df = normed_df.reset_index(drop=True)
      
      # save this long form to a csv
      normed_df.to_csv('long.csv', index=False)
      
      # display(normed_df)
                              ts   value            type
      0  2021-02-08 01:00:51.248     0.0          caudal
      1  2021-02-08 00:50:50.856     0.0          caudal
      2  2021-02-08 00:30:50.898     0.0          caudal
      3  2021-02-08 00:20:50.861     0.0          caudal
      4  2021-02-08 00:10:50.821     0.0          caudal
      5  2021-02-08 01:00:51.248  247564        FreeHeap
      6  2021-02-08 00:50:50.856  247564        FreeHeap
      7  2021-02-08 00:30:50.898  247564        FreeHeap
      8  2021-02-08 00:20:50.861  247564        FreeHeap
      9  2021-02-08 00:10:50.821  247564        FreeHeap
      10 2021-02-08 01:00:51.248  237440  MinimoFreeHeap
      11 2021-02-08 00:50:50.856  237440  MinimoFreeHeap
      12 2021-02-08 00:30:50.898  237440  MinimoFreeHeap
      13 2021-02-08 00:20:50.861  237440  MinimoFreeHeap
      14 2021-02-08 00:10:50.821  237440  MinimoFreeHeap
      
      • 使用.pivot将数据对齐,以ts为索引。
      # pivot normed_df to a wide format
      dfp = normed_df.pivot(index='ts', columns='type', values='value')
      
      # display(dfp)
      type                    FreeHeap MinimoFreeHeap caudal
      ts                                                    
      2021-02-08 00:10:50.821   247564         237440    0.0
      2021-02-08 00:20:50.861   247564         237440    0.0
      2021-02-08 00:30:50.898   247564         237440    0.0
      2021-02-08 00:50:50.856   247564         237440    0.0
      2021-02-08 01:00:51.248   247564         237440    0.0
      
      # save this wide form to a csv
      dfp.reset_index().to_csv('wide.csv', index=False)
      

      【讨论】:

        【解决方案3】:

        终于找到了解决办法... 有一个非常有趣的库,叫做“cherrypicker”。通过 pandas 的示例和数据框,我想出了如何使其工作。代码如下:

        import pandas as pd
        from cherrypicker import CherryPicker
        import json
        
        keys = {'FreeHeap', 'MinimoFreeHeap', 'caudal'} #In the future there will be more keys
        
        with open('read.json') as f_input:
             data = json.load(f_input)
        
             
             
        picker = CherryPicker(data)
        pos = 0
        for colum in keys:
            flat = picker[colum].flatten().get()
            df = pd.DataFrame(flat)
            df.columns = ['TimeStamp', colum]  #Rename
            if(pos == 0):
                fin = df
                print(fin)
                pos = 1
            else:
                del df['TimeStamp']            #Remove timestamp because it is repeated
                fin[colum] = df     
                print(fin)
        
        fin.to_csv('out.csv', encoding='utf-8', index=False)
        

        我希望它将来对某人有用,我不确定这是否是最简单的方法,但它对我有用!问候

        【讨论】:

          猜你喜欢
          • 2018-10-27
          • 1970-01-01
          • 1970-01-01
          • 2018-01-07
          • 2020-10-28
          • 2017-06-10
          • 2019-09-24
          相关资源
          最近更新 更多