Python：Netcdf：有没有一种方法可以从另一个变量与唯一值重叠的一个变量中获取总体平均值？答案

【问题标题】：Python: Netcdf: Is there a method to get the overall average from one variable where another variable overlaps with a unique value?Python：Netcdf：有没有一种方法可以从另一个变量与唯一值重叠的一个变量中获取总体平均值？
【发布时间】：2020-07-12 11:42:20
【问题描述】：

我有一个 netcdf 文件，其中包含一个名为 tag（shape time lat lon）的 3D int32 变量和一个名为 p（shape time lat lon）的 3D float64 变量。两个变量的形状大小相同。 tag 变量的整数值的起始值为 0，其结束值为未知数（它们是单调递增的）。不需要 0 值，所以我想开始一个总体（空间时间）平均 p var 其中标记值 = 1 到最大标记值 n。

示例（数组空间（时间、纬度、经度））：第一个整数标记值为 1。此值出现在例如 (0,45,45) 和 (1,45,46) 处。这些标签 = 1 数组空间的 p 值是 2 和 4。所以平均结果应该等于 3。下一个整数标签值是 2。这个值出现在说 (2,100,99)、(2,101,99) , 和 (3,101,98)，这些数组空间中的 p 值等于 3、8 和 1。所以平均结果应该等于 4。最后一个整数值是 n。该值出现在 (360,200,100)、(361,200,100)、(361,201,100) 和 (361(202,100) 处，这些数组空间中的 p 值等于 1、1、5 和 9。所以平均结果应该等于4. 将这些写入文本文件时，应如下所示：

3
4
.
.
4

下面的python代码读取netcdf文件和变量：

import datetime as dt  # Python standard library datetime  module
import numpy as np
from netCDF4 import Dataset  # http://code.google.com/p/netcdf4-python/


def ncdump(nc_fid, verb=True):
    '''
    ncdump outputs dimensions, variables and their attribute information.
    The information is similar to that of NCAR's ncdump utility.
    ncdump requires a valid instance of Dataset.

    Parameters
    ----------
    nc_fid : netCDF4.Dataset
        A netCDF4 dateset object
    verb : Boolean
        whether or not nc_attrs, nc_dims, and nc_vars are printed

    Returns
    -------
    nc_attrs : list
        A Python list of the NetCDF file global attributes
    nc_dims : list
        A Python list of the NetCDF file dimensions
    nc_vars : list
        A Python list of the NetCDF file variables
    '''
    def print_ncattr(key):
        """
        Prints the NetCDF file attributes for a given key

        Parameters
        ----------
        key : unicode
            a valid netCDF4.Dataset.variables key
        """
        try:
            print "\t\ttype:", repr(nc_fid.variables[key].dtype)
            for ncattr in nc_fid.variables[key].ncattrs():
                print '\t\t%s:' % ncattr,\
                      repr(nc_fid.variables[key].getncattr(ncattr))
        except KeyError:
            print "\t\tWARNING: %s does not contain variable attributes" % key

    # NetCDF global attributes
    nc_attrs = nc_fid.ncattrs()
    if verb:
        print "NetCDF Global Attributes:"
        for nc_attr in nc_attrs:
            print '\t%s:' % nc_attr, repr(nc_fid.getncattr(nc_attr))
    nc_dims = [dim for dim in nc_fid.dimensions]  # list of nc dimensions
    # Dimension shape information.
    if verb:
        print "NetCDF dimension information:"
        for dim in nc_dims:
            print "\tName:", dim 
            print "\t\tsize:", len(nc_fid.dimensions[dim])
            print_ncattr(dim)
    # Variable information.
    nc_vars = [var for var in nc_fid.variables]  # list of nc variables
    if verb:
        print "NetCDF variable information:"
        for var in nc_vars:
            if var not in nc_dims:
                print '\tName:', var
                print "\t\tdimensions:", nc_fid.variables[var].dimensions
                print "\t\tsize:", nc_fid.variables[var].size
                print_ncattr(var)
    return nc_attrs, nc_dims, nc_vars

nc_f = './tag.nc'  # Your filename
nc_fid = Dataset(nc_f, 'r')  # Dataset is the class behavior to open the file
                             # and create an instance of the ncCDF4 class
nc_attrs, nc_dims, nc_vars = ncdump(nc_fid)
# Extract data from NetCDF file
lats = nc_fid.variables['lat'][:]  # extract/copy the data
lons = nc_fid.variables['lon'][:]
time = nc_fid.variables['time'][:]
tag = nc_fid.variables['tag'][:]  # shape is time, lat, lon as shown above

nc_p = '../p/p.nc'  # Your filename
nc_fid = Dataset(nc_p, 'r')  # Dataset is the class behavior to open the file
                             # and create an instance of the ncCDF4 class
nc_attrs, nc_dims, nc_vars = ncdump(nc_fid)

p = nc_fid.variables['p'][:]  # shape is time, lat, lon as shown above

此代码返回：

NetCDF Global Attributes:
NetCDF dimension information:
        Name: time
                size: 365
                type: dtype('float64')
                axis: u'T'
                calendar: u'standard'
                standard_name: u'time'
                units: u'hours since 1800-01-01 00:00'
        Name: lat
                size: 287
                type: dtype('float64')
                long_name: u'latitude'
                units: u'degrees_north'
                standard_name: u'latitude'
                axis: u'Y'
        Name: lon
                size: 612
                type: dtype('float64')
                long_name: u'longitude'
                units: u'degrees_east'
                standard_name: u'longitude'
                axis: u'X'
NetCDF variable information:
        Name: tag
                dimensions: (u'time', u'lat', u'lon')
                size: 64110060
                type: dtype('int32')

我一直在玩 pandas groupby 函数，但我还没有找到适合我的示例的东西。

【问题讨论】：

标签： python pandas numpy netcdf

【解决方案1】：

我找到了一个快速有效的解决方案。检查结果，它们是正确的。

使用 xarray 打开数据，然后我将数据转换为数据框。之后我可以使用 pandas groupby 进行计算。

from pylab import *
import numpy as np
import pandas as pd
import xarray as xr
import netCDF4

# Open data with xarray
dt = xr.open_mfdataset(['../tag.nc', '../p/p.nc'], combine='by_coords')

# Convert to data frame
dtdf = dt.to_dataframe()

dm = {'p': ['mean']}
mean = dtdf.groupby('tag').agg(dm)
mean.columns = ['_'.join(col) for col in mean.columns.values]
p_mean = mean.loc[1:, 'p_mean']

【讨论】：