【发布时间】:2021-05-27 04:28:28
【问题描述】:
我是 python 新手,不知道所有方面。
我想遍历 dataframe (2D) 并将其中一些值分配给 xarray (3D)。
我的 xarray 的坐标是公司股票代码 (1)、财务变量 (2) 和每日日期 (3)。
每个公司的dataframe 的列是一些与xarray 相同的财务变量,索引由季度日期组成。
我的目标是为每个公司获取一个已经生成的dataframe,并在某个变量的列和某个日期的行中查找一个值,并将其分配到xarray 中的相应位置.
由于某些日期不会出现在 dataframe 的索引中(每个日历年只有 4 个日期),我想将 0 分配给 xarray 上的那个位置或上一个的值xarray 上的日期,如果该值也不为 0。
我曾尝试使用嵌套的 for 循环来做到这一点,但只需大约 20 秒即可遍历一个变量中的所有日期。
如果我的日期列表由大约 8000 个日期组成,变量列表有大约 30 个变量,公司列表大约有 800 家公司。
如果我要循环所有这些,我需要几天时间才能完成嵌套的 for 循环。
有没有更快的方法将这些值分配给xarray?我的猜测类似于iterrows() 或iteritems() 但在xarray。
下面是我的程序的示例代码,其中包含较短的公司和变量列表:
import pandas as pd
from datetime import datetime, date, timedelta
import numpy as np
import xarray as xr
import time
start_time = time.time()
# We create the df. This is aun auxiliary made-up df. Its a shorter version of the real df.
# The real df I want to use is much larger and comes from an external method.
cols = ['cashAndCashEquivalents', 'shortTermInvestments', 'cashAndShortTermInvestments', 'totalAssets',
'totalLiabilities', 'totalStockholdersEquity', 'netIncome', 'freeCashFlow']
rows = []
for year in range(1989, 2020):
for month, day in zip([3, 6, 9, 12], [31, 30, 30, 31]):
rows.append(date(year, month, day))
a = np.random.randint(100, size=(len(rows), len(cols)))
df = pd.DataFrame(data=a, columns=cols)
df.insert(column='date', value=rows, loc=0)
# This is just to set the date format so that I can later look up the values
for item, i in zip(df.iloc[:, 0], range(len(df.iloc[:, 0]))):
df.iloc[i, 0] = datetime.strptime(str(item), '%Y-%m-%d')
df.set_index('date', inplace=True)
# Coordinates for the xarray:
companies = ['AAPL'] # This is actually longer (around 800 companies), but for the sake of the question, it is limited to just one company.
variables = ['totalAssets', 'totalLiabilities', 'totalStockholdersEquity'] # Same as with the companies (around 30 variables).
first_date = date(1998, 3, 25)
last_date = date.today() + timedelta(-300)
dates = pd.date_range(start=first_date, end=last_date).tolist()
# We create a zero xarray, so that we can later fill it up with values:
z = np.zeros((len(companies), len(variables), len(dates)))
ds = xr.DataArray(z, coords=[companies, variables, dates],
dims=['companies', 'variables', 'dates'])
# We assign values from the df to the ds
for company in companies:
for variable in variables:
first_value_found = False
for date in dates:
# Dates in the df are quarterly dates and dates in the ds are daily dates.
# We start off by looking for a certain date in the df. If we dont find it, we give it the value 0 in the ds
# If we do find it, we assign it the value found in the df and tell it that the first value has been found
# Now that the first value has been found, when we dont find a value in the df, instead of giving it a value of 0, we give it the value of the last date.
if first_value_found == False:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
first_value_found = True
except:
ds.loc[company, variable, date] = 0
else:
try:
ds.loc[company, variable, date] = df.loc[date, variable]
except:
ds.loc[company, variable, date] = ds.loc[company, variable, date + timedelta(-1)]
print("My program took", time.time() - start_time, "to run")
主要问题在于 for 循环,因为我已经在单独的文件中测试了这些循环,而这些似乎是最耗时的。
【问题讨论】:
标签: python pandas performance loops python-xarray