【发布时间】:2017-03-10 16:34:15
【问题描述】:
我有一个 spark 数据框,我想以 0.60、0.20、0.20 的比例将其分为训练、验证和测试。
我使用了以下代码:
def data_split(x):
global data_map_var
d_map = data_map_var.value
data_row = x.asDict()
import random
rand = random.uniform(0.0,1.0)
ret_list = ()
if rand <= 0.6:
ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'train')
elif rand <=0.8:
ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'test')
else:
ret_list = (data_row['TRANS'] , d_map[data_row['ITEM']] , data_row['Ratings'] , 'validation')
return ret_list
split_sdf = ratings_sdf.map(data_split)
train_sdf = split_sdf.filter(lambda x : x[-1] == 'train').map(lambda x :(x[0],x[1],x[2]))
test_sdf = split_sdf.filter(lambda x : x[-1] == 'test').map(lambda x :(x[0],x[1],x[2]))
validation_sdf = split_sdf.filter(lambda x : x[-1] == 'validation').map(lambda x :(x[0],x[1],x[2]))
print "Total Records in Original Ratings RDD is {}".format(split_sdf.count())
print "Total Records in training data RDD is {}".format(train_sdf.count())
print "Total Records in validation data RDD is {}".format(validation_sdf.count())
print "Total Records in test data RDD is {}".format(test_sdf.count())
#help(ratings_sdf)
Total Records in Original Ratings RDD is 300001
Total Records in training data RDD is 180321
Total Records in validation data RDD is 59763
Total Records in test data RDD is 59837
我的原始数据框是 ratings_sdf,我用它来传递一个映射器函数来进行拆分。
如果您检查训练、验证和测试的总和,则不等于拆分(原始评级)计数。这些数字在每次运行代码时都会发生变化。
剩余的记录去哪了,为什么总和不相等?
【问题讨论】:
标签: python apache-spark pyspark