生成给定百分位等级的分布答案

【问题标题】：Generate distribution given percentile ranks生成给定百分位等级的分布
【发布时间】：2013-01-27 12:34:50
【问题描述】：

我想在 R 中生成一个分布，给出以下score and percentile ranks。

x <- 1:10
PercRank <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)

PercRank = 1 例如告诉 1% 的数据有一个value/score <= 1（x 的第一个值）。同样，PercRank = 7 告诉 7% 的数据有 value/score <= 2 等。

我不知道如何找到底层分布。如果我能从这么多信息中获得一些关于如何获取基础分布的pdf 的指导，我会很高兴。

【问题讨论】：

What have you tried?
@Arun：您提供的答案显然是针对与此不同的问题。您提供的值没有 1:10 范围内的支持域。
@Arun：发布的问题看起来更准确。

标签： r statistics

【解决方案1】：

来自Wikipedia：

分数的百分位排名是分数在其频率分布中与它相同或低于它的百分比。

为了说明这一点，让我们创建一个分布，比如normal distribution，带有mean=2 和sd=2，以便我们稍后可以测试（我们的代码）。

# 1000 samples from normal(2,2)
x1 <- rnorm(1000, mean=2, sd=2)

现在，让我们使用您在帖子中提到的 percentile rank。让我们将它除以 100，以便它们代表累积概率。

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100

这些百分位数对应的值是多少（scores）？

# generating values similar to your x.
x <- c(t(quantile(x1, cum.p)))
> x
 [1] -2.1870396 -1.4707273 -1.1535935 -0.8265444 -0.2888791  
         0.2781699  0.5893503  0.8396868  1.4222489  2.1519328

这意味着 1% 的数据小于 -2.18。 7% 的数据小于 -1.47 等等……现在，我们有 x 和 cum.p（相当于你的 PercRank）。让我们忘记x1 以及这应该是一个正态分布的事实。为了找出它可能是什么分布，让我们使用 diff 从累积概率中获取实际概率，该概率取第 n 个和第 (n-1) 个元素之间的差异。

prob <- c( cum.p[1], diff(cum.p), .01)
> prob
# [1] 0.01 0.06 0.05 0.11 0.18 0.21 0.11 0.07 0.12 0.07 0.01

现在，我们所要做的就是为每个 x (x[1]:x[2], x[2]:x[3] ...) 的间隔生成大小为 100（可以是任意数字）的样本，然后最终从这个庞大的数据中采样尽可能多的点您需要（例如，10000），概率如上所述。

这可以通过以下方式完成：

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 5) 
fin  <- abs(max(x)) + 5

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)

现在，我们有来自分布的 10000 个样本。让我们看看它是怎样的。它应该类似于均值 = 2 和 sd = 2 的正态分布。

> hist(out)

> c(mean(out), sd(out))
# [1] 1.954834 2.170683

这是mean = 1.95 和sd = 2.17 (~ 2) 的正态分布（来自直方图）。

注意：我所解释的某些事情可能是迂回的和/或代码“可能/可能不”适用于其他一些发行版。这篇文章的目的只是用一个简单的例子来解释这个概念。

编辑：为了澄清@Dwin's 点，我尝试了与OP 的问题对应的x = 1:10 相同的代码，通过替换x 的值使用相同的代码。

cum.p <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)/100
prob <- c( cum.p[1], diff(cum.p), .01)
x <- 1:10

freq <- 10000 # final output size that we want

# Extreme values beyond x (to sample)
init <- -(abs(min(x)) + 1) 
fin  <- abs(max(x)) + 1

ival <- c(init, x, fin) # generate the sequence to take pairs from
len <- 100 # sequence of each pair

s <- sapply(2:length(ival), function(i) {
    seq(ival[i-1], ival[i], length.out=len)
})
# sample from s, total of 10000 values with probabilities calculated above
out <- sample(s, freq, prob=rep(prob, each=len), replace = T)

> quantile(out, cum.p) # ~ => x = 1:10
# 1%     7%    12%    23%    41%    62%    73%    80%    92%    99% 
# 0.878  1.989  2.989  4.020  5.010  6.030  7.030  8.020  9.050 10.010 

> hist(out)

【讨论】：

@Arun，init <- -(abs(min(x)) + 5) 中的减号可能是错误的。对于具有所有正值的样本，它似乎不起作用。不能只是init <- min(x) - 5吗？

【解决方案2】：

我认为您需要 ecdf 函数，它在 ?quantile 帮助页面上被称为 quantile 函数的逆函数。..

# construct your vector containing the data
PercRank <- c(1, 7, 12, 23, 41, 62, 73, 80, 92, 99)

# construct an empirical cumulative distribution function
# which is really just the `inverse` of `quantile
Fn <- ( ecdf( PercRank ) )
# note that the `ecdf` function returns a function itself.

# calculate what percent of `PercRank` is below these integers..
Fn( 0 )
Fn( 1 )
Fn( 2 )
Fn( 3 )
Fn( 6 )
Fn( 7 )
Fn( 8 )


# re-construct your `x` vector using PercRank
Fn( PercRank ) * 10

【讨论】：

@Arun 它是一个阶梯函数。 x <- seq(0, 100, by=.1); plot(x, Fn(x))
谢谢。您将如何重新创建基础数据并绘制直方图？
@maycobra hist(Fn(x)) 我猜，但我不完全明白你想要做什么。如果 Arun 的回答没有解决您的问题，请详细说明您想要的输出:)

【解决方案3】：

这会生成一个数据集，其中包含您指定的特征。如果你想要更多的“随机性”，你可以在匿名函数内减去百分位数范围内的一些随机数到 rep 结果：

   > mapply( function(x,y) rep(x, each=y), (x),  diff(c(PercRank, 100) ) )
[[1]]
[1] 1 1 1 1 1 1

[[2]]
[1] 2 2 2 2 2

[[3]]
 [1] 3 3 3 3 3 3 3 3 3 3 3

[[4]]
 [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

[[5]]
 [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

[[6]]
 [1] 6 6 6 6 6 6 6 6 6 6 6

[[7]]
[1] 7 7 7 7 7 7 7

[[8]]
 [1] 8 8 8 8 8 8 8 8 8 8 8 8

[[9]]
[1] 9 9 9 9 9 9 9

[[10]]
[1] 10

【讨论】：

投反对票的人能否解释一下不正确之处？