在 R 中的两列之间随机化数据答案

【问题标题】：Randomize data between two columns in R在 R 中的两列之间随机化数据
【发布时间】：2017-02-10 22:31:50
【问题描述】：

我已经为此任务寻找答案或解决方案，但目前还没有成功，所以如果这是多余的，我深表歉意。

我想随机化两列之间的数据。这是为了模拟植被场数据中的物种错误识别，因此我也想在两列之间分配某种错误识别概率。我想有一些方法可以使用sample 或“permute”包来做到这一点。

我将选择一些现成的数据作为示例。

library (vegan)
data (dune)

如果您输入head (dune)，那么您可以看到这是一个数据框，其中站点为行，物种为列。为方便起见，我们可以假设一些现场技术人员可能会错误识别早熟禾和早熟禾。

poa = data.frame(Poaprat=dune$Poaprat,Poatriv=dune$Poatriv)
head(poa)
           Poaprat      Poatriv
1             4            2
2             4            7
3             5            6
4             4            5
5             2            6
6             3            4

在这两列之间随机化值的最佳方法是什么（在彼此之间传输和/或在两者都存在时相加）。结果数据可能如下所示：

           Poaprat      Poatriv
1             6            0
2             4            7
3             5            6
4             5            4
5             0            7
6             4            3

附言

对于那些畏缩不前的生态学家：请注意，我做了这个例子是为了节省时间，而且我知道相对覆盖值不是累加的。我很抱歉需要这样做。

*** 编辑：为了更清楚起见，被随机化的数据类型将是覆盖率估计值（因此值介于 0% 和 100% 之间）。这个快速示例中的数据是相对覆盖率估计，而不是计数。

【问题讨论】：

根据什么分布/权重随机化？如果未加权的正态分布很好，那么为什么不只是 unique() 组合列和组合列的添加以及 sample() 呢？否则mapply() 或purrr::map2() 跨列并以这种方式随机添加或更改值？
大概，如果有人不能很好地区分两个物种，分布可能是均匀随机的，而不是正态分布。也没有理由相信它是对称的。所有记录可能是针对一个物种的，或者现场成员可能在两者之间随机选择（即基于错误的字符）。对不起，我应该更清楚。

标签： r permutation

【解决方案1】：

您仍然需要用新的列替换实际的列，并且可能有一种更优雅的方法来做到这一点（在 EDT 领域已经很晚了）并且您必须决定什么除了你想要使用的正态分布（即你将如何替换sample()）之外，但是你得到你的交换并添加：

library(vegan)
library(purrr)

data(dune)

poa <- data.frame(
  Poaprat=dune$Poaprat,
  Poatriv=dune$Poatriv
)

map2_df(poa$Poaprat, poa$Poatriv, function(x, y) {
  for (i in 1:length(x)) {
    what <- sample(c("left", "right", "swap"), 1)
    switch(
      what,
      left={ 
        x[i] <- x[i] + y[i]
        y[i] <- 0
      },
      right={ 
        y[i] <- x[i] + y[i]
        x[i] <- 0
      },
      swap={
        tmp <- y[i]
        y[i] <- x[i]
        x[i] <- tmp
      }
    )
  }
  data.frame(Poaprat=x, Poatriv=y)
})

【讨论】：

谢谢，这似乎很有帮助。如果我弄错了，请见谅，但我对sample 的印象是它只支持统一随机抽样。我是否对整个函数被称为“随机样本和排列”以及“仅支持统一采样”的声明了解太多？
我确实提到如果你想要一个不同的发行版，你需要替换它。

【解决方案2】：

这是我的方法：

让我们定义一个函数，该函数将获取多个样本 (n) 以及它可能被错误标记的概率 (p)。此函数将采样概率为p 的 1 和概率为1-p 的 0。此随机抽样的总和将给出有多少 n 样本不正确。

mislabel = function(x, p){
    N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
    sum(N_mis)
}

定义函数后，将其应用于每一列并将其存储到两个新列中

p_miss = 0.3

poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)

每个物种标记的标本的最终数量可以通过从同一物种中减去不正确并从另一个标本中添加不正确来计算。

poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled

结果：

> head(poa)
  Poaprat Poatriv Poaprat_mislabeled Poatriv_mislabeled Poaprat_final Poatriv_final
1       4       2                  0                  0             4             2
2       4       7                  1                  2             5             6
3       5       6                  0                  3             8             3
4       4       5                  1                  2             5             4
5       2       6                  0                  3             5             3
6       3       4                  1                  2             4             3

完成程序：

mislabel = function(x, p){
    N_mis = sample(c(1,0), x, replace = T, prob = c(p, 1-p))
    sum(N_mis)
}


p_miss = 0.3

poa$Poaprat_mislabeled = sapply(poa$Poaprat, mislabel, p_miss)
poa$Poatriv_mislabeled = sapply(poa$Poatriv, mislabel, p_miss)

poa$Poaprat_final = poa$Poaprat - poa$Poaprat_mislabeled + poa$Poatriv_mislabeled
poa$Poatriv_final = poa$Poatriv - poa$Poatriv_mislabeled + poa$Poaprat_mislabeled

p_miss 变量是错误标记两个物种的概率。您还可以为每个值使用不同的值来模拟一个不对称的机会，即与另一个相比，错误地标记其中一个可能更容易。

【讨论】：

这似乎是一种随机计数的有效方法。不幸的是，此数据（以及我使用的类型）使用百分比覆盖率估计，并且此示例中的数据是相对覆盖率分数（不知道确切的比例 - 数据的一些历史记录：davidzeleny.net/anadat-r/doku.php/en:data:dune）。我对此不够清楚。不过功能不错。

【解决方案3】：

自从接受了 hrbrmstr 的回答后，我只想签到。今天有一点时间，我继续做一个函数，它以一定程度的灵活性完成这项任务。它允许包含多个物种对，不同物种对之间的不同概率（不同方向的不对称），并明确包括值保持不变的概率。

misID = function(X, species,probs = c(0.1,0.1,0,0.8)){
library(purrr)

X2 = X

if (!is.matrix(species) == T){
as.matrix(species)
}


if (!is.matrix(probs) == T){
probs=matrix(probs,ncol=4,byrow=T)
}

if (nrow(probs) == 1){
probs = matrix(rep(probs[1,],nrow(species)),ncol=4,byrow=T)
}

for (i in 1:nrow(species)){

Spp = data.frame(X[species[i,1]],X[species[i,2]])


mis = map2_df(Spp[1],Spp[2],function(x,y) {
  for(n in 1:length(x)) {
    what = sample(c('left', 'right', 'swap','same'), size=1,prob=probs[i,])
    switch(
      what,
      left = {
        x[n] = x[n] + y[n]
        y[n] = 0
      },
      right = {
        y[n] = x[n] + y[n]
        x[n] = 0
      },
      swap = {
        tmp = y[n]
        y[n] = x[n]
        x[n] = tmp
      },
      same = {
      x[n] = x[n]
      y[n] = y[n]
      }
    )
}
misSpp = data.frame(x,y)
colnames(misSpp) =c(names(Spp[1]),names(Spp[2]))
return(misSpp)

})
X2[names(mis[1])] = mis[1]
X2[names(mis[2])] = mis[2]
}
return(X2)
}

这里可能存在一些小问题，但总的来说，它可以满足我的需要。抱歉，没有 cmets，但我确实想出了如何轻松地将打乱后的数据放入数据框中。

感谢您为我指出“purrr”包以及switch 函数。

例子：

library(vegan)
library(labdsv)
data(dune)

#First convert relative abundances to my best guess at the % values in Van der Maarel (1979)
code = c(1,2,3,4,5,6,7,8,9)
value = c(0.1,1,2.5,4.25,5.5,20,40,60.5,90)
veg = vegtrans(dune,code,value)

specpairs = matrix(c("Poaprat","Poatriv","Trifprat","Trifrepe"),ncol=2,byrow=T) #create matrix of species pairs
probmat = matrix(c(0.3,0,0,0.7,0,0.5,0,0.5),ncol=4,byrow=T)                     #create matrix of misclassification probabilities

veg2 = misID(veg,specpairs,probs = probmat) 

print(veg2)

【讨论】：