在 Julia 中将数组拆分为训练和测试集的有效方法是什么？答案

【问题标题】：What is an efficient way to split an array into a training and testing set in Julia?在 Julia 中将数组拆分为训练和测试集的有效方法是什么？
【发布时间】：2016-05-05 08:20:32
【问题描述】：

所以我在 Julia 中运行机器学习算法，但我的机器上的备用内存有限。无论如何，我注意到我在存储库中使用的代码存在相当大的瓶颈。似乎（随机）拆分数组比从磁盘读取文件花费的时间更长，这似乎突出了代码的低效率。正如我之前所说，任何加速此功能的技巧将不胜感激。原函数可以在here找到。由于是一个简短的函数，我也将它贴在下面。

# Split a list of ratings into a training and test set, with at most
# target_percentage * length(ratings) in the test set. The property we want to
# preserve is: any user in some rating in the original set of ratings is also
# in the training set and any item in some rating in the original set of ratings
# is also in the training set. We preserve this property by iterating through
# the ratings in random order, only adding an item to the test set only if we
# haven't already hit target_percentage and we've already seen both the user
# and the item in some other ratings.
function split_ratings(ratings::Array{Rating,1},
                       target_percentage=0.10)
    seen_users = Set()
    seen_items = Set()
    training_set = (Rating)[]
    test_set = (Rating)[]
    shuffled = shuffle(ratings)
    for rating in shuffled
        if in(rating.user, seen_users) && in(rating.item, seen_items) && length(test_set) < target_percentage * length(shuffled)
            push!(test_set, rating)
        else
            push!(training_set, rating)
        end
        push!(seen_users, rating.user)
        push!(seen_items, rating.item)
    end
    return training_set, test_set
end

如前所述，无论如何我都可以推送数据将不胜感激。我还要注意，我真的不需要保留删除重复项的能力，但这将是一个不错的功能。此外，如果这已经在 Julia 库中实现，我将不胜感激。任何利用 Julia 的并行能力的解决方案都会获得奖励积分！

【问题讨论】：

查看github.com/JuliaML/MLDataUtils.jl
单例评分（对于项目或用户而言是唯一的）自动路由到训练集 [可能] 不会对学习算法产生偏差吗？
我没有实现原始算法，反正我忽略了我最终使用的代码中的那部分。

标签： arrays optimization machine-learning julia

【解决方案1】：

就内存而言，这是我能想到的最有效的代码。

function splitratings(ratings::Array{Rating,1}, target_percentage=0.10)
  N = length(ratings) 
  splitindex = round(Integer, target_percentage * N)
  shuffle!(ratings) #This shuffles in place which avoids the allocation of another array!
  return sub(ratings, splitindex+1:N), sub(ratings, 1:splitindex) #This makes subarrays instead of copying the original array!
end

然而，Julia 极其缓慢的文件 IO 现在是瓶颈。该算法在 1.7 亿个元素的数组上运行大约需要 20 秒，所以我说它相当高效。

【讨论】：

这会在测试集和训练集上复制splitindex 处的元素；使用splitindex+1:N。另请注意，iround 在 0.4 版中已弃用。
感谢指点！复制拆分索引并不是一个可怕的问题，但我会在我的帖子中快速修复它。