如何使用 xarray 按时间分组，然后在组上运行 bin 函数？答案

【问题标题】：How to use xarray to group by time and then run a bin function on the groups?如何使用 xarray 按时间分组，然后在组上运行 bin 函数？
【发布时间】：2021-05-15 15:32:20
【问题描述】：

我有一个多维的“总海洋膨胀的平均方向”（mdts），netCDF 数据集。维度为time（以小时为单位）、latitude 和longitude。我只是希望按天对每小时数据进行分组，然后对于每一天，对于每个纬度/经度网格，确定 16 个预定义方向箱中的哪一个包含最多小时（最多可以是 24 个）。然后，对于每个纬度/经度网格，与具有最多小时数的 bin 关联的方向值将被分配为每个纬度/经度网格的特定日期的方向。我正在将自定义函数应用于groupby 命令，这就是发生错误的地方。我想我不明白传递给函数的内容。

注意：每个 netCDF 文件代表 1979-2019 一个月。因此，我使用groupby 而不是resample，因为resample 添加了文件中没有的另外11 个月份。我还首先将所有时间转换为 00:00，以便 groupby 可以按天分组。

注意：我的实际代码设置为循环遍历多个 netCDF 文件。我在这里将其简化为一个文件。我的简化代码：

import numpy as np
import xarray as xr
        
ifile = 'mean_direction_total_swell_Nov_1979_2019_hourly.nc'
        
# min, max, and center values of angle direction bins
min = [348.75,  11.25,  33.75,  56.25,  78.75, 101.25, 123.75, 146.25, 168.75, 191.25, 213.75, 236.25, 258.75, 281.25, 303.75, 326.25]
max = [ 11.25,  33.75,  56.25,  78.75, 101.25, 123.75, 146.25, 168.75, 191.25, 213.75, 236.25, 258.75, 281.25, 303.75, 326.25, 348.75]
dir = [   0.0,   22.5,   45.0,   67.5,   90.0,  112.5,  135.0,  157.5,  180.0,  202.5,  225.0,  247.5,  270.0,  292.5,  315.0,  337.5]
    
# custom function that I think is causing the problem    
def bins(x):
    bins_n = np.zeros([16], dtype=int)
        
    # North bin requires 'or' statement
    if(x >= min[0] or x < max[0]): bins_n[0] = bins_n[0] + 1
        
    # other bins require 'and' statement
    for i in range(1,16,1): # bins
        if(x >= min[i] and x < max[i]):
            bins_n[i] = bins_n[i] + 1
            break
    slot = np.argmax(bins_n)
        
    return dir[slot]
    
   
idatanc = xr.open_dataset(ifile)              
idata = idatanc['mdts']                          
    
idata.coords['time'] = idata.time.dt.floor('1D') # setting all hourly values to 0000 
idata_dy = idata.groupby("time").apply(bins)

返回什么。注意：此错误基于多个 netCDF 文件的循环程序，因此它可能与上面的代码不完全对应。错误还是一样。

Traceback (most recent call last):

  File "<ipython-input-216-82adffe45690>", line 9, in <module>
    idata_dy = idata.groupby("time").apply(bins)

  File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 815, in apply
    return self.map(func, shortcut=shortcut, args=args, **kwargs)

  File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 800, in map
    return self._combine(applied, shortcut=shortcut)

  File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 819, in _combine
    applied_example, applied = peek_at(applied)

  File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\utils.py", line 183, in peek_at
    peek = next(gen)

  File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\groupby.py", line 799, in <genexpr>
    applied = (maybe_wrap_array(arr, func(arr, *args, **kwargs)) for arr in grouped)

  File "<ipython-input-215-3d060f71ca15>", line 6, in bins
    if(x >= min[0] or x < max[0]): bins_n[0] = bins_n[0] + 1

  File "C:\Users\TWHawk\Anaconda3\envs\tim_python36\lib\site-packages\xarray\core\common.py", line 119, in __bool__
    return bool(self.values)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

【问题讨论】：

如果没有您的数据样本，我们将无法正确测试您的代码。但是，在我看来，问题出在 or 运算符上。尝试使用np.logical_or，例如np.logical_or(x >= min[0], x < max[0])。如果它适合您，请对 and 运算符执行相同操作，您在下面有几行
@Ralubrusto 谢谢！我正在根据您的想法编辑我的代码。同时，如果您想查看，这里有一个指向实际数据文件的链接。 dropbox.com/s/tgbry1z6jgot816/…

标签： python pandas-groupby python-xarray

【解决方案1】：

我没有一路检查结果，但我认为下面的代码可以满足您的需要：

import numpy as np
import xarray as xr
from scipy import stats

def func(x, axis):
    mode, count = np.apply_along_axis(stats.mode, axis, x)
    return mode.squeeze()

infile = 'mean_direction_total_swell_Nov_1979_2019_hourly.nc'

ds = xr.open_dataset(infile)

# make sure range is 0 <= x < 360
ds['mdts'] = np.mod(ds['mdts'], 360)

# bin the data in 16 directions (17 actually, North appears as the first and
# last bin)
step = 360 / 16
centers = np.r_[np.r_[0: 360: step], 0]
edges = np.r_[0, np.r_[step / 2: 360: step], 360]

ds['mdts_binned_idx'] = (ds['mdts'].dims, np.digitize(ds['mdts'], edges))

ds['mdts_binned'] = (ds['mdts'].dims, centers[ds['mdts_binned_idx'] - 1])

# apply stats.mode to get the modal (most common) value in each day
ds2 = xr.Dataset()
ds2['mdts_mode_1d'] = ds['mdts_binned'].resample(time='1D').reduce(func)

【讨论】：

谢谢！这行得通！我确实稍微修改了我的个人代码以创建 16 个 bin 并将日期坐标转换为仅日期格式（不包括小时），因此我可以使用原始帖子中所述的 groupby 命令。