【发布时间】:2016-05-05 08:20:32
【问题描述】:
所以我在 Julia 中运行机器学习算法,但我的机器上的备用内存有限。无论如何,我注意到我在存储库中使用的代码存在相当大的瓶颈。似乎(随机)拆分数组比从磁盘读取文件花费的时间更长,这似乎突出了代码的低效率。正如我之前所说,任何加速此功能的技巧将不胜感激。原函数可以在here找到。由于是一个简短的函数,我也将它贴在下面。
# Split a list of ratings into a training and test set, with at most
# target_percentage * length(ratings) in the test set. The property we want to
# preserve is: any user in some rating in the original set of ratings is also
# in the training set and any item in some rating in the original set of ratings
# is also in the training set. We preserve this property by iterating through
# the ratings in random order, only adding an item to the test set only if we
# haven't already hit target_percentage and we've already seen both the user
# and the item in some other ratings.
function split_ratings(ratings::Array{Rating,1},
target_percentage=0.10)
seen_users = Set()
seen_items = Set()
training_set = (Rating)[]
test_set = (Rating)[]
shuffled = shuffle(ratings)
for rating in shuffled
if in(rating.user, seen_users) && in(rating.item, seen_items) && length(test_set) < target_percentage * length(shuffled)
push!(test_set, rating)
else
push!(training_set, rating)
end
push!(seen_users, rating.user)
push!(seen_items, rating.item)
end
return training_set, test_set
end
如前所述,无论如何我都可以推送数据将不胜感激。我还要注意,我真的不需要保留删除重复项的能力,但这将是一个不错的功能。此外,如果这已经在 Julia 库中实现,我将不胜感激。任何利用 Julia 的并行能力的解决方案都会获得奖励积分!
【问题讨论】:
-
单例评分(对于项目或用户而言是唯一的)自动路由到训练集 [可能] 不会对学习算法产生偏差吗?
-
我没有实现原始算法,反正我忽略了我最终使用的代码中的那部分。
标签: arrays optimization machine-learning julia