了解scale() 对您的数据做了什么很重要。我从https://stackoverflow.com/a/20256272/11167644 中提取了一个例子来解释:
set.seed(1)
x <- runif(6)
x
#> [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819 0.8983897
(x - mean(x)) / sd(x)
#> [1] -0.8717643 -0.5287394 0.1170895 1.1960620 -1.0771210 1.1644732
scale(x)[1:6]
#> [1] -0.8717643 -0.5287394 0.1170895 1.1960620 -1.0771210 1.1644732
您的数据正在缩放并以零为中心 - 我们可以通过查看未缩放数据集和缩放数据集的 summary() 来进一步验证这一点:
data("USArrests")
df <- USArrests
summary(df)
#> Murder Assault UrbanPop Rape
#> Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
#> 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
#> Median : 7.250 Median :159.0 Median :66.00 Median :20.10
#> Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
#> 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
#> Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
summary(scale(df))
#> Murder Assault UrbanPop Rape
#> Min. :-1.6044 Min. :-1.5090 Min. :-2.31714 Min. :-1.4874
#> 1st Qu.:-0.8525 1st Qu.:-0.7411 1st Qu.:-0.76271 1st Qu.:-0.6574
#> Median :-0.1235 Median :-0.1411 Median : 0.03178 Median :-0.1209
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
#> 3rd Qu.: 0.7949 3rd Qu.: 0.9388 3rd Qu.: 0.84354 3rd Qu.: 0.5277
#> Max. : 2.2069 Max. : 1.9948 Max. : 1.75892 Max. : 2.6444
再次注意零的平均值 - 这解释了为什么数据总和为零。
最后,我们可以通过一些直方图直观地查看缩放数据与未缩放数据的样子:
library(tidyverse)
df %>%
select(Murder) %>%
mutate(Scaled_Murder = scale(Murder)) %>%
pivot_longer(everything()) %>%
ggplot(aes(value, fill = name)) +
geom_histogram(alpha = 0.75, position = "identity", bins = 20)
由reprex package (v0.3.0) 于 2021-03-02 创建