更快地遍历 xarray 和 dataframe答案

【问题标题】：Faster way of iterate through xarray and dataframe更快地遍历 xarray 和 dataframe
【发布时间】：2021-05-27 04:28:28
【问题描述】：

我是 python 新手，不知道所有方面。

我想遍历 dataframe (2D) 并将其中一些值分配给 xarray (3D)。我的 xarray 的坐标是公司股票代码 (1)、财务变量 (2) 和每日日期 (3)。每个公司的dataframe 的列是一些与xarray 相同的财务变量，索引由季度日期组成。

我的目标是为每个公司获取一个已经生成的dataframe，并在某个变量的列和某个日期的行中查找一个值，并将其分配到xarray 中的相应位置.

由于某些日期不会出现在 dataframe 的索引中（每个日历年只有 4 个日期），我想将 0 分配给 xarray 上的那个位置或上一个的值xarray 上的日期，如果该值也不为 0。我曾尝试使用嵌套的 for 循环来做到这一点，但只需大约 20 秒即可遍历一个变量中的所有日期。

如果我的日期列表由大约 8000 个日期组成，变量列表有大约 30 个变量，公司列表大约有 800 家公司。如果我要循环所有这些，我需要几天时间才能完成嵌套的 for 循环。有没有更快的方法将这些值分配给xarray？我的猜测类似于iterrows() 或iteritems() 但在xarray。下面是我的程序的示例代码，其中包含较短的公司和变量列表：

import pandas as pd
from datetime import datetime, date, timedelta
import numpy as np
import xarray as xr
import time

start_time = time.time()

# We create the df. This is aun auxiliary made-up df. Its a shorter version of the real df. 
# The real df I want to use is much larger and comes from an external method.
cols = ['cashAndCashEquivalents', 'shortTermInvestments', 'cashAndShortTermInvestments', 'totalAssets',
        'totalLiabilities', 'totalStockholdersEquity', 'netIncome', 'freeCashFlow']
rows = []
for year in range(1989, 2020):
    for month, day in zip([3, 6, 9, 12], [31, 30, 30, 31]):
        rows.append(date(year, month, day))
a = np.random.randint(100, size=(len(rows), len(cols)))
df = pd.DataFrame(data=a, columns=cols)
df.insert(column='date', value=rows, loc=0)
# This is just to set the date format so that I can later look up the values
for item, i in zip(df.iloc[:, 0], range(len(df.iloc[:, 0]))):
    df.iloc[i, 0] = datetime.strptime(str(item), '%Y-%m-%d')
df.set_index('date', inplace=True)

# Coordinates for the xarray:
companies = ['AAPL']  # This is actually longer (around 800 companies), but for the sake of the question, it is limited to just one company.
variables = ['totalAssets', 'totalLiabilities', 'totalStockholdersEquity']  # Same as with the companies (around 30 variables).
first_date = date(1998, 3, 25)
last_date = date.today() + timedelta(-300)
dates = pd.date_range(start=first_date, end=last_date).tolist()

# We create a zero xarray, so that we can later fill it up with values:
z = np.zeros((len(companies), len(variables), len(dates)))
ds = xr.DataArray(z, coords=[companies, variables, dates],
                  dims=['companies', 'variables', 'dates'])

# We assign values from the df to the ds
for company in companies:
    for variable in variables:
        first_value_found = False
        for date in dates:
            # Dates in the df are quarterly dates and dates in the ds are daily dates.
            # We start off by looking for a certain date in the df. If we dont find it, we give it the value 0 in the ds
            # If we do find it, we assign it the value found in the df and tell it that the first value has been found
            # Now that the first value has been found, when we dont find a value in the df, instead of giving it a value of 0, we give it the value of the last date.
            if first_value_found == False:
                try:
                    ds.loc[company, variable, date] = df.loc[date, variable]
                    first_value_found = True
                except:
                    ds.loc[company, variable, date] = 0
            else:
                try:
                    ds.loc[company, variable, date] = df.loc[date, variable]
                except:
                    ds.loc[company, variable, date] = ds.loc[company, variable, date + timedelta(-1)]

print("My program took", time.time() - start_time, "to run")

主要问题在于 for 循环，因为我已经在单独的文件中测试了这些循环，而这些似乎是最耗时的。

【问题讨论】：

标签： python pandas performance loops python-xarray

【解决方案1】：

一种可能的策略是循环遍历 DataFrame 的实际索引，而不是所有可能的索引

avail_dates = df.index
for date in avail_dates:
    # Copy the data

这应该已经减少了相当多的迭代次数。你仍然必须确保所有的空白都被填满，所以你会做类似的事情

    da.loc[company, variables, date:] = df.loc[date, variables]

没错，你可以用列表索引DataArray和DataFrame。（此外，我不会使用 ds 作为来自 xarray 的变量名，而不是 DataSet）

不过，您可能想要使用的是pandas.DataFrame.reindex()。

如果我理解你想要做什么，这应该或多或少的伎俩（未经测试）

complete_df = df.reindex(dates, method='pad', fill_value=0)
da.loc[company, variables, :] = complete_df.loc[:, variables].T

【讨论】：