【问题标题】:I need to Un-nest JSON array elements AND ensure correct mapping with 'ID' column我需要取消嵌套 JSON 数组元素并确保与“ID”列正确映射
【发布时间】:2018-10-02 09:42:06
【问题描述】:

输入的DataFrame“df”如下(请注意'id'列的值):

| id    | name                                                                                  |
|-------|---------------------------------------------------------------------------------------|
| a1xy  | [  {  "event": "sports",   "start": "100"},  {  "event": "lunch",  "start": "121" } ] |
| a7yz  | [  {  "event": "lunch",   "start": "109"},  {  "event": "movie",  "start": "97" } ]   |
| bx4y  | [  {  "event": "dinner",   "start": "78"},  {  "event": "sleep",  "start": "25" } ]   |

我想展平 JSON 数组元素,以便我的结果输出为:

| id    | name.event | name.start |
|-------|------------|------------|
| a1xy  | sports     | 100        |
| a1xy  | lunch      | 121        |
| a7yz  | lunch      | 109        |
| a7yz  | movie      | 97         |
| bx4y  | dinner     | 78         |
| bx4y  | sleep      | 25         |

“id”列中的值需要正确映射。如何在 Python 中做到这一点?

我试过了:

k = df.name.map(json.loads).apply(pd.DataFrame).tolist()
final_df = pd.concat(k)

但我无法映射“id”列中的值。

【问题讨论】:

标签: python arrays json pandas


【解决方案1】:

假设您有 json 对象列表作为以下输入

data = [{'id': 'a1xy', 'name': [{'event': 'sports', 'start': '100'},{'event': 'lunch', 'start': '121'}]},
        {'id': 'a7yz', 'name': [{'event':'lunch', 'start': '109'},'event': 'movie', 'start': '97'}]},
        {'id': 'bx4y', 'name': [{'event': 'dinner', 'start': '78'},{'event': 'sleep', 'start': '25'}]}]

df = json_normalize(data, record_path='name', meta='id', record_prefix='name.')
print(df)

【讨论】:

    【解决方案2】:

    您可以将列表理解与展平结合使用,并通过id 值更新每个字典,最后调用DataFrame 构造函数:

    df['name'] = df['name'].map(json.loads)
    
    df = pd.DataFrame([dict(y, id=i) for i, x in zip(df['id'],df['name']) for y in x])
    print (df)
        event    id start
    0  sports  a1xy   100
    1   lunch  a1xy   121
    2   lunch  a7yz   109
    3   movie  a7yz    97
    4  dinner  bx4y    78
    5   sleep  bx4y    25
    

    但如果输入是json,最好使用json_normalize

    时间安排

    df=pd.DataFrame([
    ['a1xy',[{  "event": "sports",   "start": "100"}, {  "event": "lunch",  "start": "121" } ]],
    ['a7yz',[{  "event": "lunch",   "start": "109"},  {  "event": "movie",  "start": "97" }  ]],
    ['bx4y',[{  "event": "dinner",   "start": "78"},  {  "event": "sleep",  "start": "25" }  ]]],
    columns=['id','name']) 
    print (df)
    
    #3k rows
    df = pd.concat([df] * 1000, ignore_index=True)
    
    In [276]: %%timeit
         ...: pd.DataFrame([dict(y, id=i) for i, x in zip(df['id'],df['name']) for y in x])
    9.49 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [277]: %%timeit
         ...: finalArray=[]
         ...: df.apply(lambda x: addtoArray(x,finalArray),axis=1)
         ...: pd.DataFrame(finalArray,columns=['col1','event','start'])
         ...: 
    1.81 s ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    列表理解解决方案更快180x

    【讨论】:

    • 我如何“以编程方式”将“id”列的元素添加到“name”列中?我想使用 json_normalize
    • @Symphony - json 看起来怎么样?
    • @Symphony - 就像[{'id': 'a1xy', 'name': [{'event': 'sports', 'start': '100'},{'event': 'lunch', 'start': '121'}]}, {'id': 'a7yz', 'name': [{'event':'lunch', 'start': '109'},'event': 'movie', 'start': '97'}]}, {'id': 'bx4y', 'name': [{'event': 'dinner', 'start': '78'},{'event': 'sleep', 'start': '25'}]}] ?
    • [ { "event": "sports", "start": "100"}, { "event": "lunch", "start": "121" } ]
    • [ { "id": "a1xy", "event": "sports", "start": "100"}, { "id": "a1xy", "event": "lunch ", "开始": "121" } ]
    【解决方案3】:

    您也可以在 apply 函数中使用外部函数

    import json
    data=pd.DataFrame([
    ['a1xy',[{  "event": "sports",   "start": "100"}, {  "event": "lunch",  "start": "121" } ]],
    ['a7yz',[{  "event": "lunch",   "start": "109"},  {  "event": "movie",  "start": "97" }  ]],
    ['bx4y',[{  "event": "dinner",   "start": "78"},  {  "event": "sleep",  "start": "25" }  ]]],columns=['id','name']) 
    
    def addtoArray(x,finalArray):
        finalArray.extend(np.insert(pd.DataFrame(x['name']).values,0,x['id'],axis=1).tolist())
    
    finalArray=[]
    data.apply(lambda x: addtoArray(x,finalArray),axis=1)
    finalArray=pd.DataFrame(finalArray,columns=['col1','event','start'])
    print(finalArray)
    
       col1   event start
    0  a1xy  sports   100
    1  a1xy   lunch   121
    2  a7yz   lunch   109
    3  a7yz   movie    97
    4  bx4y  dinner    78
    5  bx4y   sleep    25
    

    【讨论】:

      猜你喜欢
      • 2021-10-18
      • 2014-03-04
      • 2016-02-11
      • 1970-01-01
      • 2019-04-29
      • 1970-01-01
      • 1970-01-01
      • 2020-11-16
      • 1970-01-01
      相关资源
      最近更新 更多