【发布时间】:2014-05-10 10:40:27
【问题描述】:
我有以下问题。 假设这是我的 CSV
id f1 f2 f3
1 4 5 5
1 3 1 0
1 7 4 4
1 4 3 1
1 1 4 6
2 2 6 0
..........
所以,我有可以按 id 分组的行。 我想创建一个如下所示的 csv 作为输出。
f1 f2 f3 f1_n f2_n f3_n f1_n_n f2_n_n f3_n_n f1_t f2_t f3_t
4 5 5 3 1 0 7 4 4 1 4 6
所以,我希望能够选择要转换为列的行数(总是从 id 的第一行开始)。在这种情况下,我抓住了 3 行。 然后我还将跳过一个或多个行(在这种情况下只有一个跳过)以从同一 id 组的最后一行中获取最后一列。由于某些原因,我想使用数据框。
在挣扎了 3-4 小时后。我找到了如下给出的解决方案。 但我的解决方案很慢。我有大约 700,000 行,可能是大约 70,000 组 id。上面 model=3 的代码在我的 4GB 4 Core Lenovo 上需要将近一个小时。我需要使用模型 = 可能是 10 或 15。我仍然是 Python 新手,我相信可以进行一些更改来加快速度。有人可以深入解释我如何改进代码。
非常感谢。
model : 要抓取的行数
# train data frame from reading the csv
train = pd.read_csv(filename)
# Get groups of rows with same id
csv_by_id = train.groupby('id')
modelTarget = { 'f1_t','f2_t','f3_t'}
# modelFeatures is a list of features I am interested in the csv.
# The csv actually has hundreds
modelFeatures = { 'f1, 'f2' , 'f3' }
coreFeatures = list(modelFeatures) # cloning
selectedFeatures = list(modelFeatures) # cloning
newFeatures = list(selectedFeatures) # cloning
finalFeatures = list(selectedFeatures) # cloning
# Now create the column list depending on the number of rows I will grab from
for x in range(2,model+1):
newFeatures = [s + '_n' for s in newFeatures]
finalFeatures = finalFeatures + newFeatures
# This is the final column list for my one row in the final data frame
selectedFeatures = finalFeatures + list(modelTarget)
# Empty dataframe which I want to populate
model_data = pd.DataFrame(columns=selectedFeatures)
for id_group in csv_by_id:
#id_group is a tuple with first element as the id itself and second one a dataframe with the rows of a group
group_data = id_group[1]
#hmm - can this be better? I am picking up the rows which I need from first row on wards
df = group_data[coreFeatures][0:model]
# initialize a list
tmp = []
# now keep adding the column values into the list
for index, row in df.iterrows():
tmp = tmp + list(row)
# Wow, this one below surely should have something better.
# So i am picking up the feature column values from the last row of the group of rows for a particular id
targetValues = group_data[list({'f1','f2','f3'})][len(group_data.index)-1:len(group_data.index)].values
# Think this can be done easier too ? . Basically adding the values to the tmp list again
tmp = tmp + list(targetValues.flatten())
# coverting the list to a dict.
tmpDict = dict(zip(selectedFeatures,tmp))
# then the dict to a dataframe.
tmpDf = pd.DataFrame(tmpDict,index={1})
# I just could not find a better way of adding a dict or list directly into a dataframe.
# And I went through lots and lots of blogs on this topic, including some in StackOverflow.
# finally I add the frame to my main frame
model_data = model_data.append(tmpDf)
# and write it
model_data.to_csv(wd+'model_data' + str(model) + '.csv',index=False)
【问题讨论】:
标签: python pandas dataframe data-processing