【问题标题】:Im working on a python problem to optimize the script我正在研究一个python问题来优化脚本
【发布时间】:2020-09-19 13:43:27
【问题描述】:
  1. 将option_labels列的行值变成列标题
  2. 如果某个特定的 user_id 存在 option_labels,我将在创建的新列中应用 option_values 的值,否则为 0。

样本数据为:(data.csv)

 user_id       country        option_values        option_labels

 abc456         Germany        256gb                  SSD
 abc123         Brazil         i5                    intel 
 xyz456         France         128gb                  SSD
 xyz123         Turkey         i7                    intel 
 abc123         Brazil         2gb                   nvidia
 abc456         Germany        32gb                   RAM
 xyz123         Turkey         4gb                   nvidia
 xyz456         France         16gb                   RAM

示例输出为:

 user_id       country        option_values     option_labels     intel         nvidia       SSD        RAM 

 abc456         Germany        256gb             SSD                0              0        256gb        0
 abc123         Brazil         i5                intel              i5             0          0          0
 xyz456         France         256gb             SSD                0              0        128gb        0
 xyz123         Turkey         i7                intel              i7             0          0          0
 abc123         Brazil         2gb               nvidia             0              2gb        0          0  
 abc456         Germany        32gb              RAM                0              0          0          32gb
 xyz123         Turkey         4gb               nvidia             0              4gb        0          0
 xyz456         France         16gb              RAM                0              0          0          16gb

我已经用下面的示例代码完成了这个过程,

 import pandas as pd
 import numpy as np

 data_sample = pd.read_csv("data.csv")
 feature_list = data_sample["option_label"].unique().tolist()
 user_list = data_sample["user_id"].unique().tolist()
 country_list = data_sample["country"].unique().tolist()
 opt_val_list = data_sample["opt_val"].unique().tolist()

 def filterd_id(check_id):
     single_id_data= data_sample[data_sample['user_id'] == check_id]
     return single_id_data

 def finding_features(single_id_data):
     user_features = single_id_data["option_labels"].unique().tolist()
     return user_features

 def check_feature(feature_list, user_features): 
     feature_prs_not = []
     for i in feature_list:
         if(i in user_features):
             result = opt_val_list
         else:
             result = 0 
         feature_prs_not.append(result)          
     return feature_prs_not 

 user_id = []
 country = []

 for i in user_list: 
     check_id = i
     user_id.append(i)
     single_id_data = filterd_id(check_id)
     c = single_id_data["country"].unique().tolist()
     country.append(c)
     user_features = finding_features(single_id_data)
     feature_prst_not = check_feature(feature_list,user_features)    
     df = pd.DataFrame([feature_prst_not], columns = feature_list)
     df_feature = df_feature.append(df)
 df_user_id = pd.DataFrame(user_id, columns=['all_user_id'])
 df_country = pd.DataFrame(country, columns=['country_name'])

我的近 100k id 的原始数据需要更多时间来运行(例如 8-9 小时)。 我还在 Python 的学习阶段,我现在正在尝试优化以减少脚本的运行时间。

【问题讨论】:

    标签: python-3.x jupyterhub


    【解决方案1】:

    如果您想要更快,您需要矢量化。我相信这段代码会产生与你相同的输出

    import numpy as np
    
    for val in df['option_labels'].unique():
        df[val] = np.where(df['option_labels']==val, df['option_values'], 0)
    

    我就是这样复制你的数据的

    from io import StringIO
    
    df = pd.read_csv(StringIO(''' 
    "user_id","country","option_values","option_labels"
    "abc456","Germany","256gb","SSD"
    "abc123","Brazil","i5","intel" 
    "xyz456","France","128gb","SSD"
    "xyz123","Turkey","i7","intel" 
    "abc123","Brazil","2gb","nvidia"
    "abc456","Germany","32gb","RAM"
    "xyz123","Turkey","4gb","nvidia"
    "xyz456","France","16gb","RAM"'''))
    

    【讨论】:

    • 感谢您的快速响应,它适用于我的数据,我为这两行做了很多东西,现在学习了!..
    • 嗨,有什么方法可以将 id 分组,使其成为每个 user_id 的一行而不是多个 id 以避免内存错误问题,以便一个 user_id 的所有 option_values 驻留在同一行。
    猜你喜欢
    • 2010-10-09
    • 1970-01-01
    • 2019-03-04
    • 1970-01-01
    • 2011-12-26
    • 1970-01-01
    • 2016-12-13
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多