【问题标题】:How to use apply_ufunc with numpy.digitize for each image along time dimension of xarray.DataArray?如何沿xarray.DataArray的时间维度对每个图像使用带有numpy.digitize的apply_ufunc?
【发布时间】:2019-12-16 13:47:47
【问题描述】:

为了清楚起见,我已经大幅改写了我之前的问题。根据 Ryan 对单独频道的建议,numpy.digitize 看起来是实现我目标的正确工具。

我有一个形状为 x、y 和时间的 xarray.DataArray。我试图弄清楚我应该为apply_ufunc 函数的“input_core_dims”和“output_core_dims”参数提供哪些值,以便将numpy.digitize 应用于时间序列中的每个图像。

直观地说,我希望输出维度为 ['time', 'x', 'y']。我认为输入核心维度应该是xy,因为我想沿时间维度广播numpy.digitize 函数。但是,这不起作用。通过将 numpy.digitize 应用于我的时间序列中的第一个 numpy 数组,我得到了正确的结果:

[84]

blues
<xarray.DataArray 'reflectance' (time: 44, y: 1082, x: 1084)>
dask.array<shape=(44, 1082, 1084), dtype=uint16, chunksize=(44, 1082, 1084)>
Coordinates:
    band     int64 1
  * y        (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
  * x        (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05
  * time     (time) datetime64[ns] 2018-10-12 2018-10-16 ... 2019-05-26
Attributes:
    transform:   (3.0, 0.0, 488907.0, 0.0, -3.0, 970494.0)
    crs:         +init=epsg:32630
    res:         (3.0, 3.0)
    is_tiled:    1
    nodatavals:  (1.0, 1.0, 1.0, 1.0)
    scales:      (1.0, 1.0, 1.0, 1.0)
    offsets:     (0.0, 0.0, 0.0, 0.0)

[79]
#correct result
np.digitize(np.array(blues[0]), bin_arr)
array([[14, 15, 15, ..., 16, 17, 16],
       [14, 13, 14, ..., 16, 16, 15],
       [15, 14, 15, ..., 16, 16, 15],
       ...,
       [16, 18, 18, ..., 15, 16, 15],
       [17, 18, 18, ..., 16, 17, 16],
       [17, 17, 17, ..., 17, 18, 17]])

但是我对apply_ufunc的理解是不正确的。将 input_core_dims 更改为 [['x','y']] 或 ['time'] 不会产生正确的数字化结果

bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], dask="parallelized", output_dtypes=[blues.dtype])

#wrong values, correct shape
np.array(result)[0]

array([[14, 16, 15, ..., 48, 18, 15],
       [15, 16, 16, ..., 49, 18, 15],
       [15, 16, 16, ..., 49, 18, 14],
       ...,
       [16, 21, 17, ..., 50, 19, 15],
       [17, 21, 17, ..., 50, 19, 16],
       [16, 21, 18, ..., 50, 20, 17]])
bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['x','y']], dask="parallelized", output_dtypes=[blues.dtype])


#wrong values, correct shape
np.array(result)[0]

array([[14, 14, 15, ..., 16, 17, 17],
       [15, 13, 14, ..., 18, 18, 17],
       [15, 14, 15, ..., 18, 18, 17],
       ...,
       [16, 16, 16, ..., 15, 16, 17],
       [17, 16, 16, ..., 16, 17, 18],
       [16, 15, 15, ..., 15, 16, 17]])

这些结果中的每一个都具有正确的形状但值错误,这意味着将数字化功能应用于错误的轴,并且将结果重新调整为输入的形状。

同样奇怪的是apply_ufunc 的结果在显示为 xarray 时会丢弃 input_core_dim。但在内部,当您将其转换为 numpy 数组时,维度仍然存在

[85]

result
<xarray.DataArray 'reflectance' (y: 1082, x: 1084)>
dask.array<shape=(1082, 1084), dtype=uint16, chunksize=(1082, 1084)>
Coordinates:
    band     int64 1
  * y        (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
  * x        (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05

[87]
# the shape of the xarray and numpy array do not match after apply_ufunc
np.array(result).shape
(1082, 1084, 44) 

此外,当我尝试将 output_core_dims 参数指定为 [['time', 'x', 'y']] 来纠正此问题时,我收到一个错误,看起来您不能同时将维度作为输入核心维度和输出核心维度

[67]

bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
      5 bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
      6 blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
----> 7 result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, *args)
    967                                      join=join,
    968                                      exclude_dims=exclude_dims,
--> 969                                      keep_attrs=keep_attrs)
    970     elif any(isinstance(a, Variable) for a in args):
    971         return variables_vfunc(*args)

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    215 
    216     data_vars = [getattr(a, 'variable', a) for a in args]
--> 217     result_var = func(*data_vars)
    218 
    219     if signature.num_outputs > 1:

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, output_sizes, keep_attrs, *args)
    539                   if isinstance(arg, Variable)
    540                   else arg
--> 541                   for arg, core_dims in zip(args, signature.input_core_dims)]
    542 
    543     if any(isinstance(array, dask_array_type) for array in input_data):

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in (.0)
    539                   if isinstance(arg, Variable)
    540                   else arg
--> 541                   for arg, core_dims in zip(args, signature.input_core_dims)]
    542 
    543     if any(isinstance(array, dask_array_type) for array in input_data):

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in broadcast_compat_data(variable, broadcast_dims, core_dims)
    493                          'dimensions %r on an input variable: these are core '
    494                          'dimensions on other input or output variables'
--> 495                          % unexpected_dims)
    496 
    497     # for consistency with numpy, keep broadcast dimensions to the left

ValueError: operand to apply_ufunc encountered unexpected dimensions ['y', 'x'] on an input variable: these are core dimensions on other input or output variables

非常感谢任何帮助,我想了解我是如何滥用 input_core_dim 和 output_core_dim 参数的。

【问题讨论】:

  • 您能否提供一个玩具示例,其中包含我们可以复制粘贴到笔记本中并进行实验的相同假数据?同时提供预期的输出。

标签: image-processing time-series python-xarray


【解决方案1】:

您想逐点申请digitize。这是apply_ufunc 最简单的用例。不需要特殊参数。

Numpy 版本

import numpy as np
import xarray as xr

ny, nx = 100, 100
nt = 44
data = xr.DataArray(np.random.randn(nt,ny,nx),
                        dims=['time', 'y', 'x'],
                        name='blue reflectance')

rmin, rmax, nbins = -4, 4, 50
bins = np.linspace(rmin, rmax, nbins)

data_digitized = xr.apply_ufunc(np.digitize, data, bins)

这会返回一个类似的 DataArray

<xarray.DataArray 'blue reflectance' (time: 44, y: 100, x: 100)>
array([[[34, 17, ..., 27, 15],
         ....
        [21, 24, ..., 23, 29]]])
Dimensions without coordinates: time, y, x

根据numpy.digitize 文档中描述的约定,其中的值是 bin 索引。

简单版

要使其在 dask 数组上延迟操作,您有两种选择

# create chunked dask version of data
data_chunked = data.chunk({'time': 1})

# use dask's version of digitize
import dask.array as da
xr.apply_ufunc(da.digitize, data_chunked, bins, dask='allowed')

# use xarray's built-in `parallelized` option on the numpy function
# (I needed to define a wrapper function to make this work,
# but I don't fully understand why.)
def wrap_digitize(data):
    return np.digitize(data, bins)
xr.apply_ufunc(wrap_digitize, data_chunked,
               dask='parallelized', output_dtypes=['i8'])

【讨论】:

    【解决方案2】:

    此解决方案不再适用问题的编辑方式!

    您可能需要考虑新的xhistogram 包。

    Xhistogram 可以更轻松地计算具有多维数据的灵活、复杂的直方图。它与 Dask 集成(可选),以便扩展到非常大的数据集,并与 Xarray 集成,以便使用和生成带标签、带注释的数据结构。它可用于广泛的科学任务。

    它旨在解决您所面临的确切问题。

    from xhistogram.xarray import histogram 
    import numpy as np
    import xarray as xr
    
    # create example image timeseries
    ny, nx = 100, 100
    nt = 44
    data_arr = xr.DataArray(np.random.randn(nt,ny,nx),
                            dims=['time', 'y', 'x'],
                            name='blue reflectance')
    
    # calculate histogram over spatial dimensions
    rmin, rmax, nbins = -4, 4, 50
    bin_arr = np.linspace(rmin, rmax, nbins)
    histogram(data_arr, bins=[bin_arr], dim=['x','y'])
    

    输出如下:

    <xarray.DataArray 'histogram_blue reflectance' (time: 44, blue reflectance_bin: 49)>
    array([[0, 0, 3, ..., 1, 0, 0],
           [0, 0, 0, ..., 0, 0, 0],
           [0, 0, 0, ..., 3, 0, 0],
           ...,
           [0, 0, 1, ..., 1, 0, 0],
           [0, 1, 3, ..., 0, 1, 1],
           [0, 0, 3, ..., 2, 0, 1]])
    Coordinates:
      * blue reflectance_bin  (blue reflectance_bin) float64 -3.918 -3.755 ... 3.918
    Dimensions without coordinates: time
    

    【讨论】:

    • 感谢您的建议,我试过了,但遇到了两个我不确定如何解决的问题:github.com/xgcm/xhistogram/issues 如果人们对如何使用 apply_ufunc 完成此任务有建议,我会仍然有兴趣听到它。我有许多使用 3D 或 4D ndarrays 作为输入和输出的自定义函数,我想使用 xarray 将它们合并到工作流中。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2022-01-19
    • 2016-07-12
    • 2021-09-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多