如何将 Pandas 多索引数据框移动到 Xarray DataArray答案

【问题标题】：How to move Pandas multi-index dataframe to Xarray DataArray如何将 Pandas 多索引数据框移动到 Xarray DataArray
【发布时间】：2019-08-10 14:20:03
【问题描述】：

我正在将 CSV 文件导入 Pandas 数据框。 CSV 文件类似于：

Time,    Status, Variable, freq_1, freq_2, freq_3, .....
1/1/2000,  Hi,      A,      0.1,    3.3,    8.1, ....
1/1/2000,  Hi,      B,      2.4,    1.2,    1.3, ....
1/1/2000,  Lo,      A,      4.5,    6.9,    6.4, ....
1/1/2000,  Lo,      B,      7.1,    8.8,    2.3, ....
2/1/2000,  Hi,      A,      0.1,    3.3,    8.1, ....
2/1/2000,  Hi,      B,      2.4,    1.2,    1.3, ....
2/1/2000,  Lo,      A,      4.5,    6.9,    6.4, ....
2/1/2000,  Lo,      B,      7.1,    8.8,    2.3, ....
....

我使用时间、状态和变量作为索引将其读入具有多索引的数据帧中。

我现在想使用 Pandas to_xarray 或 Xarray from_dataframe 将数据帧导入 Xarray。但是，这两种方法似乎都会阻塞索引，引发错误：

TypeError: Could not convert tuple of form (dims, data[, attrs, encoding]): (0, DatetimeIndex(['2018-01-12 00:15:00', '2018-01-12 00:45:00',
               '2018-01-12 01:15:00', '2018-01-12 01:45:00',
               '2018-01-12 02:15:00', '2018-01-12 02:45:00',
               '2018-01-12 03:15:00', '2018-01-12 03:45:00',
               '2018-01-12 04:15:00', '2018-01-12 04:45:00',
               ...
               '2019-11-01 16:15:00', '2019-11-01 17:15:00',
               '2019-11-01 17:45:00', '2019-11-01 18:15:00',
               '2019-11-01 18:45:00', '2019-11-01 19:15:00',
               '2019-11-01 20:45:00', '2019-11-01 21:15:00',
               '2019-11-01 21:45:00', '2019-11-01 22:15:00'],
              dtype='datetime64[ns]', name=0, length=3100, freq=None)) to Variable.

我也尝试过使用 Xarray.DataArray 过程：

Mytime = PD.to_datetime(df[0],infer_datetime_format=True)
Myfreq = np.array([ 0,1,2,3...39])
OutDataArray = Xarray.DataArray(df[np.arange(3,43)], coords=[('time', Mytime ), ('freq', Myfreq ), ('Status', df[1]), ('variable', df[2])])

但这给出了错误：

ValueError: coords is not dict-like, but it has 4 items, which does not match the 2 dimensions of the data

那么，如果数据框是二维的，但其中一个维度（即行）实际上由数据框的多索引指定的多个维度组成，如何将 Pandas 数据框导入 Xarray？

根据要求，这里是重现问题的示例脚本。请注意，您需要为导入的示例数据的 CSV 文件设置文件名：

import numpy as np
import pandas as PD

# create some data
dt = PD.date_range(start='01/01/2000 00:00:00', end='02/01/2000 00:00:00', freq='30T')
ldt = len(dt)
vr1 = PD.Series(np.empty(ldt, dtype = np.str))
vr2 = PD.Series(np.empty(ldt, dtype = np.str))
vr3 = PD.Series(np.empty(ldt, dtype = np.str))
vr1.values[:] = 'apple'
vr2.values[:] = 'orange'
vr3.values[:] = 'peach'

# combine the data to mimic my file format
a = PD.Series([1,2,3,4], index=[7,2,8,9])
b = PD.Series([5,6,7,8], index=[7,2,8,9])
df1 = PD.DataFrame({'Time': dt,'Fruit':vr1, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df2 = PD.DataFrame({'Time': dt,'Fruit':vr2, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df3 = PD.DataFrame({'Time': dt,'Fruit':vr3, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df_unsorted = PD.concat([df1, df2, df3])
df = df_unsorted.sort_values(by=['Time', 'Fruit'])

# write the data to a csv file
filename = 'Give a file path/name here'
df.to_csv(filename, index=False)

#import the csv file and convert to an xarray
df2 = PD.read_csv(filename,  sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
da = df2.to_xarray()

【问题讨论】：

你能提供一些可重现的东西吗？ to_xarray 通常有效，所以我认为需要更多细节

标签： python pandas dataframe python-xarray

【解决方案1】：

您的错误似乎在于 csv 文件中的列和索引未在生成的 DataFrame 中命名。将代码示例的最后两行替换为：

df2 = PD.read_csv(filename,  sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
df2.columns = ['N1', 'N2', 'N3']
df2.index.names = ['time', 'fruit']
ds = df2.to_xarray()

导致成功转换为 xarray 数据集。

print(ds)

<xarray.Dataset>
Dimensions:  (fruit: 3, time: 1489)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ... 2000-02-01
  * fruit    (fruit) object 'apple' 'orange' 'peach'
Data variables:
    N1       (time, fruit) float64 0.114 0.3726 0.5072 ... 0.2065 0.9082 0.7945
    N2       (time, fruit) float64 0.7534 0.1107 0.8866 ... 0.4509 0.5218 0.1472
    N3       (time, fruit) float64 0.156 0.6498 0.3521 ... 0.3742 0.5899 0.607

更新：您可以通过删除PD.read_csv() 中的skiprows=1 和header=None 参数来跳过手动设置列名和索引名，从csv 标题中获取列名。所以你的最后两行看起来像：

df2 = PD.read_csv(filename,  sep=',', skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
ds = df2.to_xarray()

【讨论】：

所以 Xarray 不能为没有标题行的 CSV 文件处理具有默认列标签（即 [0,1,2,3,...]）的 Pandas 数据框？
看起来会这样。但是您可以使这更简单，而不必直接使用 csv 中的标题手动设置列和索引名称。我更新了我的答案。
很遗憾，我的 CSV 文件标题不适合列命名。
@Dan 如果我们使用这种方法，我们如何从数据框中自定义一个 xarray.Dataset。假设坐标将具有不在昏暗和数据变量 N1 中的附加变量，只有时间没有结果