【问题标题】:Scaling not all numeric columns in Train and Test data sets of a mixed data frame缩放混合数据帧的训练和测试数据集中的所有数字列
【发布时间】:2021-09-27 07:32:11
【问题描述】:

以下代码对训练集和测试集进行缩放。由于 Col6 和 Col7 不得缩放,因此将它们从原始数据中删除以缩放训练集和测试集:

library(tidyverse)

Data_Frame <- data.frame(Col1 = c("A1", "A1", "A1", "A2", "A2", "A2", "A3", "A3", "A3"),
                         
                         Col2 = c("2011-03-11", "2014-08-21", "2016-01-17", "2017-06-30", "2018-07-11", "2018-11-28", "2019-09-04", "2020-02-29", "2020-07-12"),
                         
                         Col3 = c("2018-10-22", "2019-05-24", "2020-12-25", "2018-10-12", "2019-09-24", "2020-12-19", "2018-10-22", "2019-06-14", "2020-12-20"),
                         
                         Col4 = c(4, 12, 2, 1, 4, 4, 75, 4, 44),
                         
                         Col5 = c(7.81, 6.45, 3, 1, 3, 2, 5, 1, 2),
                         
                         Col6 = c(1, 1, 1, 1, 1, 1, 1, 1, 1),
                         
                         Col7 = c(2, 2, 2, 2, 2, 2, 2, 2, 2),
                         
                         Col8 = c(7.77, 6, 8.4, -11.23, 3.5, 7.2, 15, 100, 22.22))

# randomly split data in r
sample_size = floor(0.8*nrow(Data_Frame))
set.seed(777)
picked = sample(seq_len(nrow(Data_Frame)),size = sample_size)
Train_Set = Data_Frame[picked,]
Test_Set = Data_Frame[-picked,]

# Remove columns Col6 and Col7, which will not be scaled
Train <- Train_Set %>% dplyr::select(- c(Col6, Col7))
Test <- Test_Set %>% dplyr::select(- c(Col6, Col7))

# Scale Train, collect mean and sd to scale in Test
Train_Scale <- Train %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)
num_cols <- names(which(sapply(Train,is.numeric)))
scale_params <- attributes(scale(Train[,num_cols]))[c("scaled:center","scaled:scale")]

# Scale Test with the scales of Train
Test_Scale <- Test
Test_Scale[,num_cols] = scale(Test_Scale[,num_cols],center=scale_params[[1]],scale=scale_params[[2]]) 

尝试

varnames <- c('Col6', 'Col7')
index <- names(Train_Set) %in% varnames
Train_Scale_Check <- Train_Set[, !index] %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector) 

有效,但从数据框中删除 Col6 和 Col7。

还有,

Train_Scale_Check <- Train_Set %>% dplyr::mutate_if(is.numeric, !index, ~scale(.) %>% as.vector)

抛出以下错误:

Error: expecting a one sided formula, a function, or a function name.
Run `rlang::last_error()` to see where the error occurred.

rlang::last_error()
<error/rlang_error>
expecting a one sided formula, a function, or a function name.
Backtrace:
 1. dplyr::mutate_if(...)
 2. dplyr:::manip_if(...)
 3. dplyr:::as_fun_list(.funs, .env, ..., .caller = .caller)
 4. dplyr:::map(...)
 5. base::lapply(.x, .f, ...)
 6. dplyr:::FUN(X[[i]], ...)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
expecting a one sided formula, a function, or a function name.
Backtrace:
    x
 1. \-dplyr::mutate_if(...)
 2.   \-dplyr:::manip_if(...)
 3.     \-dplyr:::as_fun_list(.funs, .env, ..., .caller = .caller)
 4.       \-dplyr:::map(...)
 5.         \-base::lapply(.x, .f, ...)
 6.           \-dplyr:::FUN(X[[i]], ...)

有没有一种简单的方法可以在 Train_Set 和 Test_Set 数据集中保留 Col6 和 Col7,但不能对其进行缩放?将列 Col6 和 Col7 提取为单独的数据帧的冗长方法,使用顶部的代码并最终 cbind Col6 和 Col7 数据帧。

【问题讨论】:

  • 您似乎没有在任何地方定义索引
  • 对不起,我忘了在 Train_Scale_Check 上面添加两行代码。我现在已经更新了。
  • 这个线程完全符合我的要求,但是必须列出所有未缩放的变量,而不是通过忽略要缩放的变量和因子变量来仅缩放数字变量:community.rstudio.com/t/…
  • 你在Train_Set %&gt;% mutate(across(where(is.numeric) &amp; -c(Col6, Col7), scale))之后吗?
  • 谢谢,它也可以工作并保存未缩放的列。

标签: r dataframe dplyr standardized data-preprocessing


【解决方案1】:

以下解决了问题(感谢@27 φ 9 的建议)

仅在所需列处缩放训练集(忽略 Col6 和 Col7)

varnames <- c('Col6', 'Col7')
index <- names(Train_Set) %in% varnames
Train_Scale <- Train_Set %>%  mutate(across(where(is.numeric) & -all_of(varnames), ~scale(.) %>% as.vector))

拿起秤:

num_cols <- names(which(sapply(subset(Train_Set, select=-c(Col6, Col7)), is.numeric)))
scale_params <- attributes(scale(Train_Set[,num_cols]))[c("scaled:center","scaled:scale")]

使用测试数据中的尺度:

Test_Scale <- Test_Set
Test_Scale[,num_cols] = scale(Test_Scale[,num_cols],center=scale_params[[1]],scale=scale_params[[2]])

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-06-04
    • 2019-06-16
    • 2018-10-08
    • 2019-01-14
    • 2021-04-19
    • 2017-02-20
    • 2018-01-11
    • 2017-12-25
    相关资源
    最近更新 更多