【问题标题】:Python - Pandas - How to drop null values from to_json after dataframe mergePython - Pandas - 如何在数据框合并后从 to_json 中删除空值
【发布时间】:2018-02-21 13:57:23
【问题描述】:

我正在构建一个“外部连接”两个 csv 文件并将结果导出为 json 对象的过程。

# read the source csv files
firstcsv = pandas.read_csv('file1.csv',  names = ['main_index','attr_one','attr_two'])
secondcsv = pandas.read_csv('file2.csv',  names = ['main_index','attr_three','attr_four'])

# merge them
output = firstcsv.merge(secondcsv, on='main_index', how='outer')

jsonresult = output.to_json(orient='records')
print(jsonresult)

现在,两个csv文件是这样的:

file1.csv:
1, aurelion, sol
2, lee, sin
3, cute, teemo

file2.csv:
1, midlane, mage
2, jungler, melee

我希望生成的 json 输出如下:

[{"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
{"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
{"main_index":3,"attr_one":"cute","attr_two":"teemo"}]

相反,我使用 main_index = 3 上线

{"main_index":3,"attr_one":"cute","attr_two":"teemo","attr_three":null,"attr_four":null}]

所以空值会自动添加到输出中。 我想删除它们 - 我环顾四周,但找不到合适的方法。

希望有人可以帮助我!

【问题讨论】:

    标签: python json csv null output


    【解决方案1】:

    由于我们使用的是 DataFrame,pandas 将使用 NaN“填充”值,即

    >>> print(output)
          main_index   attr_one attr_two attr_three attr_four
    0           1   aurelion      sol    midlane      mage
    1           2        lee      sin    jungler     melee
    2           3       cute    teemo        NaN       NaN
    

    我在 pandas.to_json 文档中看不到任何跳过空值的选项:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

    所以我想出的方法涉及重新构建 JSON 字符串。对于数百万行的大型数据集,这可能不是很高效(但联盟中的冠军不到 200 人,所以应该不是一个大问题!)

    from collections import OrderedDict
    import json
    
    jsonresult = output.to_json(orient='records')
    # read the json string to get a list of dictionaries
    rows = json.loads(jsonresult)
    
    # new_rows = [
    #     # rebuild the dictionary for each row, only including non-null values
    #     {key: val for key, val in row.items() if pandas.notnull(val)}
    #     for row in rows
    # ]
    
    # to maintain order use Ordered Dict
    new_rows = [
        OrderedDict([
            (key, row[key]) for key in output.columns
            if (key in row) and pandas.notnull(row[key])
        ])
       for row in rows
    ]
    
    new_json_output = json.dumps(new_rows)
    

    您会发现new_json_output 已经删除了所有具有 NaN 值的键,并保持了顺序:

    >>> print(new_json_output)
    [{"main_index": 1, "attr_one": " aurelion", "attr_two": " sol", "attr_three": " midlane", "attr_four": " mage"},
     {"main_index": 2, "attr_one": " lee", "attr_two": " sin", "attr_three": " jungler", "attr_four": " melee"},
     {"main_index": 3, "attr_one": " cute", "attr_two": " teemo"}]
    

    【讨论】:

    • 这行得通,但我失去了元素的顺序(假设我用 reindex_axis 方法指定了一个自定义顺序)我想我需要使用一些 OrderedDict 来保持排序......
    • 我昨天晚上才找到它...但还是非常感谢您的帮助!
    【解决方案2】:

    我试图实现同样的目标,并找到了以下解决方案,我认为应该很快(尽管我还没有测试过)。回答原始问题有点太晚了,但可能对某些人有用。

    # Data
    df = pd.DataFrame([
        {"main_index":1,"attr_one":"aurelion","attr_two":"sol","attr_three":"midlane","attr_four":"mage"},
        {"main_index":2,"attr_one":"lee","attr_two":"sin","attr_three":"jungler","attr_four":"melee"},
        {"main_index":3,"attr_one":"cute","attr_two":"teemo"}
    ])
    

    给出一个缺失值的 DataFrame。

    >>> print(df)
      attr_four  attr_one attr_three attr_two  main_index
    0      mage  aurelion    midlane      sol           1
    1     melee       lee    jungler      sin           2
    2       NaN      cute        NaN    teemo           3
    

    要将其转换为 json,您可以在过滤掉空值后,将 to_json() 应用于转置后的 DataFrame 的每一行。然后加入json,用逗号分隔,用括号括起来。

    # To json    
    json_df = df.T.apply(lambda row: row[~row.isnull()].to_json())
    json_wrapped = "[%s]" % ",".join(json_df)
    

    然后

    >>> print(json_wrapped)
    [{"attr_four":"mage","attr_one":"aurelion","attr_three":"midlane","attr_two":"sol","main_index":1},{"attr_four":"melee","attr_one":"lee","attr_three":"jungler","attr_two":"sin","main_index":2},{"attr_one":"cute","attr_two":"teemo","main_index":3}]
    

    【讨论】:

      猜你喜欢
      • 2020-08-31
      • 2015-07-17
      • 1970-01-01
      • 2020-12-14
      • 2023-01-04
      • 1970-01-01
      • 1970-01-01
      • 2015-09-03
      • 2012-07-24
      相关资源
      最近更新 更多