Pandas DataFrame 中的 JSON 对象答案

【问题标题】：JSON object inside Pandas DataFramePandas DataFrame 中的 JSON 对象
【发布时间】：2018-01-23 01:43:15
【问题描述】：

我在 pandas 数据框列中有一个 JSON 对象，我想将其拆分并放入其他列中。在数据框中，JSON 对象看起来像一个包含字典数组的字符串。数组可以是可变长度的，包括零，或者该列甚至可以为空。我写了一些代码，如下所示，它可以满足我的需求。列名由两个组件构成，第一个是字典中的键，第二个是字典中键值的子字符串。

这段代码运行良好，但在大数据帧上运行时非常慢。谁能提供一种更快（可能更简单）的方法来做到这一点？此外，如果您看到一些不明智/高效/pythonic 的东西，请随意挑选我所做的事情。我还是一个相对的初学者。非常感谢。

# Import libraries 
import pandas as pd
from IPython.display import display # Used to display df's nicely in jupyter notebook.
import json

# Set some display options
pd.set_option('max_colwidth',150)

# Create the example dataframe
print("Original df:")
df = pd.DataFrame.from_dict({'ColA': {0: 123, 1: 234, 2: 345, 3: 456, 4: 567},\
 'ColB': {0: '[{"key":"keyValue=1","valA":"8","valB":"18"},{"key":"keyValue=2","valA":"9","valB":"19"}]',\
  1: '[{"key":"keyValue=2","valA":"28","valB":"38"},{"key":"keyValue=3","valA":"29","valB":"39"}]',\
  2: '[{"key":"keyValue=4","valA":"48","valC":"58"}]',\
  3: '[]',\
  4: None}})
display(df)

# Create a temporary dataframe to append results to, record by record
dfTemp = pd.DataFrame()

# Step through all rows in the dataframe
for i in range(df.shape[0]):
    # Check whether record is null, or doesn't contain any real data
    if pd.notnull(df.iloc[i,df.columns.get_loc("ColB")]) and len(df.iloc[i,df.columns.get_loc("ColB")]) > 2:
        # Convert the json structure into a dataframe, one cell at a time in the relevant column
        x = pd.read_json(df.iloc[i,df.columns.get_loc("ColB")])
        # The last bit of this string (after the last =) will be used as a key for the column labels
        x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
        # Set this new key to be the index
        y = x.set_index('key')
        # Stack the rows up via a multi-level column index
        y = y.stack().to_frame().T
        # Flatten out the multi-level column index
        y.columns = ['{1}_{0}'.format(*c) for c in y.columns]
        # Give the single record the same index number as the parent dataframe (for the merge to work)
        y.index = [df.index[i]]
        # Append this dataframe on sequentially for each row as we go through the loop
        dfTemp = dfTemp.append(y)

# Merge the new dataframe back onto the original one as extra columns, with index mataching original dataframe
df = pd.merge(df,dfTemp, how = 'left', left_index = True, right_index = True)

print("Processed df:")
display(df)

【问题讨论】：

只是一件小事。您可以将循环替换为 for i, col_b in enumerate(df.iloc[:,df.columns.get_loc("ColB")]): 并相应地更改对该条目的引用以提高可读性。
谢谢！这无疑使它更加简洁和可读。

标签： python json pandas dataframe

【解决方案1】：

首先，关于熊猫的一般建议。 如果您发现自己遍历数据框的行，您很可能做错了。

考虑到这一点，我们可以使用 pandas 的“应用”方法重新编写您当前的过程（这可能会在一开始就加快速度，因为这意味着在 df 上的索引查找次数要少得多）：

# Check whether record is null, or doesn't contain any real data
def do_the_thing(row):
    if pd.notnull(row) and len(row) > 2:
        # Convert the json structure into a dataframe, one cell at a time in the relevant column
        x = pd.read_json(row)
        # The last bit of this string (after the last =) will be used as a key for the column labels
        x['key'] = x['key'].apply(lambda x: x.split("=")[-1])
        # Set this new key to be the index
        y = x.set_index('key')
        # Stack the rows up via a multi-level column index
        y = y.stack().to_frame().T
        # Flatten out the multi-level column index
        y.columns = ['{1}_{0}'.format(*c) for c in y.columns]

        #we don't need to re-index
            # Give the single record the same index number as the parent dataframe (for the merge to work)
            #y.index = [df.index[i]]
        #we don't need to add to a temp df
        # Append this dataframe on sequentially for each row as we go through the loop
        return y.iloc[0]
    else:
        return pd.Series()
df2 = df.merge(df.ColB.apply(do_the_thing), how = 'left', left_index = True, right_index = True)

请注意，这将返回与以前完全相同的结果，我们没有更改逻辑。 apply 方法对索引进行排序，所以我们可以合并，很好。

我相信这会在加快速度和更惯用的方面回答您的问题。

我认为您应该考虑一下，您想用这种数据结构做什么，以及如何更好地构建您正在做的事情。

鉴于 ColB 可以是任意长度，您最终会得到一个包含任意列数的数据框。当您出于任何目的访问这些值时，无论出于何种目的，这都会让您感到痛苦。

ColB 中的所有条目都重要吗？你能不能只保留第一个？你需要知道某个 valA val 的索引吗？

这些是你应该问自己的问题，然后决定一个结构，它允许你做任何你需要的分析，而不必检查一堆任意的东西。

【讨论】：

非常感谢您的全面回复，非常感谢！您的代码更简单、更好且更易于重用。我实施了您的建议，并将执行时间减少了 20%。也感谢其他建议。我同意我的整体方法不是很好。一种可能性是从列中创建一个新的数据框，并使用一个新列来指定“键”值。因此，我不会为每个键值添加一组新的列，而是添加一组新的行。下次我会试试的——如果我能弄清楚怎么做的话。 :-)