【发布时间】:2020-09-19 13:43:27
【问题描述】:
- 将option_labels列的行值变成列标题
- 如果某个特定的 user_id 存在 option_labels,我将在创建的新列中应用 option_values 的值,否则为 0。
样本数据为:(data.csv)
user_id country option_values option_labels
abc456 Germany 256gb SSD
abc123 Brazil i5 intel
xyz456 France 128gb SSD
xyz123 Turkey i7 intel
abc123 Brazil 2gb nvidia
abc456 Germany 32gb RAM
xyz123 Turkey 4gb nvidia
xyz456 France 16gb RAM
示例输出为:
user_id country option_values option_labels intel nvidia SSD RAM
abc456 Germany 256gb SSD 0 0 256gb 0
abc123 Brazil i5 intel i5 0 0 0
xyz456 France 256gb SSD 0 0 128gb 0
xyz123 Turkey i7 intel i7 0 0 0
abc123 Brazil 2gb nvidia 0 2gb 0 0
abc456 Germany 32gb RAM 0 0 0 32gb
xyz123 Turkey 4gb nvidia 0 4gb 0 0
xyz456 France 16gb RAM 0 0 0 16gb
我已经用下面的示例代码完成了这个过程,
import pandas as pd
import numpy as np
data_sample = pd.read_csv("data.csv")
feature_list = data_sample["option_label"].unique().tolist()
user_list = data_sample["user_id"].unique().tolist()
country_list = data_sample["country"].unique().tolist()
opt_val_list = data_sample["opt_val"].unique().tolist()
def filterd_id(check_id):
single_id_data= data_sample[data_sample['user_id'] == check_id]
return single_id_data
def finding_features(single_id_data):
user_features = single_id_data["option_labels"].unique().tolist()
return user_features
def check_feature(feature_list, user_features):
feature_prs_not = []
for i in feature_list:
if(i in user_features):
result = opt_val_list
else:
result = 0
feature_prs_not.append(result)
return feature_prs_not
user_id = []
country = []
for i in user_list:
check_id = i
user_id.append(i)
single_id_data = filterd_id(check_id)
c = single_id_data["country"].unique().tolist()
country.append(c)
user_features = finding_features(single_id_data)
feature_prst_not = check_feature(feature_list,user_features)
df = pd.DataFrame([feature_prst_not], columns = feature_list)
df_feature = df_feature.append(df)
df_user_id = pd.DataFrame(user_id, columns=['all_user_id'])
df_country = pd.DataFrame(country, columns=['country_name'])
我的近 100k id 的原始数据需要更多时间来运行(例如 8-9 小时)。 我还在 Python 的学习阶段,我现在正在尝试优化以减少脚本的运行时间。
【问题讨论】:
标签: python-3.x jupyterhub