R：将数据分配给他们的百分位数答案

【问题标题】：R: Assigning Data to their PercentilesR：将数据分配给他们的百分位数
【发布时间】：2022-01-26 20:45:07
【问题描述】：

我正在使用 R 编程语言。假设，我有以下数据框：

var_1 = rnorm(100,10,10)
var_2 = rnorm(100,10,10)
var_3 = rnorm(100,10,10)

d = data.frame(var_1, var_2, var_3)

head(d)


      var_1     var_2      var_3
1 14.251923 14.877801  22.636207
2  7.325137  8.513718  21.021522
3  3.400001 -3.400397  11.274797
4 16.400597  8.623980   9.366115
5  7.065583 13.155570  17.891432
6 21.297912  4.341385 -11.337330

我的问题：对于每个变量中的每个元素，我想将元素替换为它所属的百分位数（例如，第 5、第 10、第 15 等）。

例如：

a = quantile(d$var_1, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
b = quantile(d$var_2, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))
c = quantile(d$var_3, c(0.05, 0.10, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1))

new = data.frame(a,b,c)

              a           b          c
5%   -0.8806901 -7.40560488 -4.7353920
10%   0.3595086 -3.77910527 -0.6874766
15%   1.1201300 -2.91946322  0.9584040
20%   3.0581928  0.05127097  2.1457693
25%   5.0901641  1.91719913  4.6997966
30%   7.0056228  2.56215345  6.2691894
35%   7.6089831  3.58688942  7.1900823
40%   8.9853805  5.00957881  7.8488446
45%   9.9264540  5.73653135  8.6135093
50%  10.2235212  7.43425669  9.6063344
55%  11.5707533  8.54160196 10.9239040
60%  13.2422940  9.65006232 11.7036647
65%  15.1076889 11.07081528 13.2440004
70%  16.5354881 12.38804922 15.2585324
75%  17.9336020 13.16121940 17.6656208
80%  19.5312682 15.31472178 18.4820207
85%  21.9264905 17.99689941 19.3347983
90%  24.4511364 20.47478783 22.0647173
95%  26.6820271 25.27082341 24.4473033
100% 41.4419744 39.75848302 34.5105183

现在，每次变量在每个百分位范围之间时，我都想进行以下替换：

如果d$var_1 < -0.8806901，那么d$var_1 == as.factor("5th percentile")
如果d$var_1 > -0.8806901 d$var_1 < 0.3595086，那么d$var_1 == as.factor("10th percentile")

...

如果d$var_1 > 15.1076889 d$var_1 < 16.5354881，那么d$var_1 == as.factor("65th percentile")

等等

如果d$var_2 < -7.40560488，那么d$var_2 == as.factor("5th percentile")

等等

如果d$var_3 < -4.7353920，那么d$var_3 == as.factor("5th percentile")

等等

谁能告诉我怎么做？

【问题讨论】：

标签： r data-manipulation percentage quantile

【解决方案1】：

这可能是你想要的

apply(d, 2, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile" ))

输出

       var_1              var_2              var_3             
  [1,] "60th percentile"  "100th percentile" "25th percentile" 
  [2,] "80th percentile"  "60th percentile"  "100th percentile"
  [3,] "45th percentile"  "90th percentile"  "75th percentile" 
  [4,] "70th percentile"  "85th percentile"  "35th percentile" 
  [5,] "30th percentile"  "5th percentile"   "55th percentile" 
  ...

补充

library(data.table)
cols = c("var_1", "var_3")
setDT(d)[, (cols) := lapply(.SD, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile")), .SDcols = cols]

【讨论】：

@王和平：非常感谢您的回答！我对您的代码做了一些修改： final_data = data.frame(apply(d, 2, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile" )))
我唯一的问题是 - 这可以直接从我创建的“新”数据框完成吗？
例如，如果分位数不均匀怎么办？例如： a = quantile(d$var_1, c(0.05, 0.16, 0.23, 0.61, 0.82, 0.98)) ； b = 分位数（d$var_2, c(0.09, 0.26, 0.33, 0.55, 0.91)) ； c = 分位数(d$var_3, c(0.19, 0.51, 0.87))
ntile(x , n = 20L) 提供粗略排名以避免比较。对于不均匀的分位数，需要检查与ntile的关系是否仍然成立。
感谢您的回复！另一个问题 - 假设我只想替换第 1 列和第 3 列。这样的事情可能吗？ final_data = data.frame(apply(d[,c(1,3), 2, function(x) paste0( ntile(x, n = 20L) / 20 * 100, "th percentile" )))

【解决方案2】：

您也可以在这里使用purrr（感谢@PeaceWang 提供的功能）。

library(tidyverse)

output <- purrr::map(d, function(x)
  paste0(ntile(x, n = 20L) / 20 * 100, "th percentile")) %>%
  as.data.frame()

输出

head(output, 10)

               var_1            var_2            var_3
1    40th percentile  15th percentile  85th percentile
2     5th percentile  60th percentile  70th percentile
3    65th percentile  60th percentile  65th percentile
4    60th percentile  10th percentile  75th percentile
5    15th percentile  40th percentile   5th percentile
6    10th percentile  35th percentile  85th percentile
7    30th percentile  45th percentile  95th percentile
8    85th percentile  25th percentile  45th percentile
9    75th percentile  90th percentile  80th percentile
10   65th percentile 100th percentile  10th percentile

【讨论】：

@安德鲁：非常感谢您的回答！我使用“均匀且一致”的百分位数来让我的问题更容易回答 - 但如果百分位数不均匀怎么办？
例如，如果： a = quantile(d$var_1, c(0.05, 0.16, 0.23, 0.61, 0.82, 0.98)) ; b = 分位数（d$var_2, c(0.09, 0.26, 0.33, 0.55, 0.91)) ； c = 分位数(d$var_3, c(0.19, 0.51, 0.87))
R 中是否有一个“匹配”函数可以用另一个数据框中的其他元素的预定义列表替换一个数据框中列中的每个元素？
我意识到 - 您可以决定要替换哪些列！在这里，您只能在第一列和第三列进行替换： output % as.data.frame()

【解决方案3】：

这是我的santoku 包的几乎单行：

library(santoku)
d[] <- apply(d, 2, chop_quantiles, probs = 0:100/100, 
               labels = lbl_endpoint(fmt = "%.2f"))
d[] <- apply(d, 2, as.numeric)

左侧的d[] 是将d 保留为数据框的技巧。

【讨论】：