【问题标题】:R: How to split data into training and testing set, while preserving proportions & distributions of variables?R:如何将数据拆分为训练集和测试集,同时保留变量的比例和分布?
【发布时间】:2020-08-15 09:14:24
【问题描述】:

可重现的例子:

library(caTools) #for sample.split function
set.seed(123)
#Creating example data frame
example_df <- data.frame(personID = > c(stringi::stri_rand_strings(1000, 5)),
                           sex = sample(1:2, 1000, replace=TRUE),
                           age = round(rnorm(1000, mean=50, sd=15), 0))

#Example of random splitting:
training_set <- example_df[sample.split(example_df$personID),]
test_set <- example_df[-c(training_set$personID),]

#evaluation of variables in test and training data sets:
  #Has to approximate 1 (in this case it's 1.2, which is too high)
  (sum(training_set$sex == 1) / sum(training_set$sex == 2)) / (sum(test_set$sex == 1) / sum(test_set$sex == 2)) 
  [1] 1.219139
  #Has to approximate 1 along the distribution (it's quite good, this is actually what i would expect)
  summary(training_set$age) / summary(test_set$age)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.7143  0.9756  1.0000  1.0032  1.0169  1.0000 

虽然 sample.split 函数对age 进行了适当的划分(分布匹配),但sex 变量中的男性和女性比例存在显着差异。使用什么函数自动将数据拆分为多个(在本例中为两个)集合,同时保留变量的比例和分布?

【问题讨论】:

  • 一般来说是个好问题。但是,并非完全可重现。 sample.split 来自哪里?流氓&gt; 在创建data.frame 时在做什么?我强烈建议您查看reprex 包。

标签: r testing


【解决方案1】:

caret 包将为您构建平衡集。检查包含基础知识的包vignette。例如:

inTrain <- createDataPartition(
  y = Sonar$Class,
  ## the outcome data are needed
  p = .75,
  ## The percentage of data in the
  ## training set
  list = FALSE
)

【讨论】:

    猜你喜欢
    • 2019-03-07
    • 2022-12-24
    • 1970-01-01
    • 2017-06-11
    • 2016-07-04
    • 2019-05-01
    • 2018-10-13
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多