【问题标题】:What is an efficient way to split an array into a training and testing set in Julia?在 Julia 中将数组拆分为训练和测试集的有效方法是什么?
【发布时间】:2016-05-05 08:20:32
【问题描述】:

所以我在 Julia 中运行机器学习算法,但我的机器上的备用内存有限。无论如何,我注意到我在存储库中使用的代码存在相当大的瓶颈。似乎(随机)拆分数组比从磁盘读取文件花费的时间更长,这似乎突出了代码的低效率。正如我之前所说,任何加速此功能的技巧将不胜感激。原函数可以在here找到。由于是一个简短的函数,我也将它贴在下面。

# Split a list of ratings into a training and test set, with at most
# target_percentage * length(ratings) in the test set. The property we want to
# preserve is: any user in some rating in the original set of ratings is also
# in the training set and any item in some rating in the original set of ratings
# is also in the training set. We preserve this property by iterating through
# the ratings in random order, only adding an item to the test set only if we
# haven't already hit target_percentage and we've already seen both the user
# and the item in some other ratings.
function split_ratings(ratings::Array{Rating,1},
                       target_percentage=0.10)
    seen_users = Set()
    seen_items = Set()
    training_set = (Rating)[]
    test_set = (Rating)[]
    shuffled = shuffle(ratings)
    for rating in shuffled
        if in(rating.user, seen_users) && in(rating.item, seen_items) && length(test_set) < target_percentage * length(shuffled)
            push!(test_set, rating)
        else
            push!(training_set, rating)
        end
        push!(seen_users, rating.user)
        push!(seen_items, rating.item)
    end
    return training_set, test_set
end

如前所述,无论如何我都可以推送数据将不胜感激。我还要注意,我真的不需要保留删除重复项的能力,但这将是一个不错的功能。此外,如果这已经在 J​​ulia 库中实现,我将不胜感激。任何利用 Julia 的并行能力的解决方案都会获得奖励积分!

【问题讨论】:

  • 单例评分(对于项目或用户而言是唯一的)自动路由到训练集 [可能] 不会对学习算法产生偏差吗?
  • 我没有实现原始算法,反正我忽略了我最终使用的代码中的那部分。

标签: arrays optimization machine-learning julia


【解决方案1】:

就内存而言,这是我能想到的最有效的代码。

function splitratings(ratings::Array{Rating,1}, target_percentage=0.10)
  N = length(ratings) 
  splitindex = round(Integer, target_percentage * N)
  shuffle!(ratings) #This shuffles in place which avoids the allocation of another array!
  return sub(ratings, splitindex+1:N), sub(ratings, 1:splitindex) #This makes subarrays instead of copying the original array!
end

然而,Julia 极其缓慢的文件 IO 现在是瓶颈。该算法在 1.7 亿个元素的数组上运行大约需要 20 秒,所以我说它相当高效。

【讨论】:

  • 这会在测试集和训练集上复制splitindex 处的元素;使用splitindex+1:N。另请注意,iround 在 0.4 版中已弃用。
  • 感谢指点!复制拆分索引并不是一个可怕的问题,但我会在我的帖子中快速修复它。
猜你喜欢
  • 2021-05-09
  • 2022-01-18
  • 2015-10-06
  • 2017-11-01
  • 2017-06-11
  • 2020-03-17
  • 2018-04-22
  • 2018-10-13
  • 1970-01-01
相关资源
最近更新 更多