【问题标题】:Extracting dict to dataframe from dataframe column containing paths从包含路径的数据框列中提取字典到数据框
【发布时间】:2020-07-01 23:04:37
【问题描述】:

我正在尝试自动格式化从传感器收集的大量 JSON 文件。我创建了一个初始数据框,其中包含每个文件的路径信息,以及传感器数据的标签。我正在尝试遍历每个 JSON 文件,将传感器读数提取到数据帧中,然后想要加入原始数据帧。数据可在以下https://github.com/MJLongstreth/stackoverflow

这是我到目前为止所得到的。

# Import necessary packages
import os
import pandas as pd
import json

data_files = []
for dirpath, subdirs, files in os.walk('.'):
    for x in files:
        if x.endswith(".json"):
            data_files.append(os.path.join(dirpath, x))

# Delete variable no longer needed    
del dirpath, files, x, subdirs

# Read file paths into a dataframe
df = pd.DataFrame(data_files)

# Rename column to path
df.columns = ['path']

# Split path to extract labels, sensor type, date, filename and then join file path
df = pd.DataFrame(df.apply(lambda x: x.str.split('/'))['path'].to_list(),
                  columns=['delete', 'folder', 'label', 'sensor_type', 'collection_date', 'file']).join(df).drop(['delete', 'folder'], axis=1)                                                                                                       

# Initialize empty list to store data from json files                                                                                                   
data = []

# Loop over data files paths and add json file dictionary to list
for file in data_files:
    x = pd.read_json(file,
                     lines=True)
    data.append(x)

# Add data to dataframe
df['data'] = data

# Delete variable no longer needed 
del data, data_files, x, file

# Split DF into dataframes by sensor type
acc_data = df[df['sensor_type'] == 'acc']
gyro_data = df[df['sensor_type'] == 'gyro']

这就是我想要从那里做的事情,但只针对其中一个 JSON 文件

# Unpack first level of dictionary
df_1 = acc_data['data'].iloc[0].apply(pd.Series)

temp_1 = []

for index, row in df_1.iterrows():
    temp_1.append(row.apply(pd.Series))
    
temp_2 = []

for i in temp_1:
    for index, row in i.iterrows():
        #row = row.drop('Timestamp')
        row = row.apply(pd.Series)
        temp_2.append(row)
    
temp_3 = []
    
for i in temp_2:
    y = i.stack().apply(pd.Series).mean()
    temp_3.append(y)
    
temp_4 = []

for i in temp_3:
    x = pd.DataFrame(i).transpose()
    temp_4.append(x)
    
empty_df = pd.DataFrame()

for i in temp_4:
    empty_df = empty_df.append(i, ignore_index=True)

我开始尝试结合我的 FOR 循环,但我冻结了我的电脑,与以下

test = acc_data['data'].to_list()

temp = []
temp_2 = []
temp_3 = []
temp_4 = []

for i in test:    
    for index, row in i.iterrows():
        temp.append(row.apply(pd.Series))
        for i in temp:
            for index, row in i.iterrows():
                #row = row.drop('Timestamp')
                row = row.apply(pd.Series)
                temp_2.append(row)

任何关于以更有效的方式完成我正在尝试做的事情的建议将不胜感激。谢谢。

【问题讨论】:

    标签: python json pandas dataframe dictionary


    【解决方案1】:

    我能够找到解决上述问题的方法。在这里发布代码,以防对其他人有用。

    # Import necessary packages
    import os
    import pandas as pd
    import json
    import sys
    import timeit
    
    # Start timer to evaluate script efficiency
    start = timeit.default_timer()
    
    # Initialize empty list to store json file paths
    data_files = []
    
    # Search working directory for json files and append path to data files list
    for dirpath, subdirs, files in os.walk('.'):
        for x in files:
            if x.endswith(".json"):
                data_files.append(os.path.join(dirpath, x))
        
    # Delete variable no longer needed           
    del dirpath, files, subdirs, x
    
    # Loop to read each file in data files and extract dictionary contents to \
        # dataframe
    for i in range(len(data_files)):
        
        # Each json file contains x number of dictionaries, read each dictionary \
            # into a list
        data = [json.loads(line) for line in open(data_files[i], 'r')]
        
        # Retrieve dictionary key value
        for item in data[i].keys():
            item
        
        # Retrieve dictionary data from key
        x = list(map(lambda x: x[item], data))
        
        # Retrieve dictionary key for next loop
        for item in x[0].keys():
            item
        
        # Initialize empty data frame
        df = pd.DataFrame()
        
        # Loop through extracted dictionaries and extract array information to \
            # separate lines keeping the 'Timestamp'
        for z in x:
            temp_df = pd.DataFrame(z[item])
            temp_df['Timestamp'] = z['Timestamp']
            df = df.append(temp_df, ignore_index=True)
        
        # Create column in dataframe indicating the source file
        df['source'] = data_files[i]
        
        # Create file name for export from original file name, replacing JSON \
            # with csv
        file_name = data_files[i].split('/')[-1].replace('.json', '.csv')
        
        # Export each JSON file that has been converted to a dataframe as a csv
        df.to_csv('./model_data/' + file_name)
        
    # End timer
    stop = timeit.default_timer()
    
    # Calculate total time
    total_time = stop - start
    
    # Output running time in a nice format.
    mins, secs = divmod(total_time, 60)
    hours, mins = divmod(mins, 60)
    
    sys.stdout.write("Total running time: %d:%d:%d.\n" % (hours, mins, secs))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-01-11
      • 1970-01-01
      • 2021-10-14
      • 2018-09-16
      • 2021-09-10
      • 2021-01-02
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多