【问题标题】:How to change few column names in a data table如何更改数据表中的几个列名
【发布时间】:2016-05-23 01:12:27
【问题描述】:

我有一个包含 10 列的数据表。

town    
tc  
one  
two  
three   
four    
five    
six  
seven   
total

需要为我正在使用的“一”到“总计”列生成平均值,

DTmean <- DT[,(lapply(.SD,mean)),by = .(town,tc),.SDcols=3:10]

这会生成平均值,但我希望列名以“_mean”为后缀。我们应该怎么做?希望前两列与“town”和“tc”保持相同。我尝试了以下方法,但随后将所有“一”重命名为“总计”为“_mean”

for (i in 3:10) {
  setnames(DTmean,i,paste0(names(i),"_mean"))
}

【问题讨论】:

  • 您能否将您的数据添加到此问题中。您可以在数据上使用dput 并发布输出。
  • @zyurnaidi 在data.table 上使用names&lt;- 创建一个副本。在您提出的解决方案之前和之后检查address(DTmean)setnames 不会受此影响。
  • 提示:当您提出问题时,最好包含reproducible example。这将使其他人更容易帮助您。

标签: r data.table


【解决方案1】:

如果你想用data.table的方式来做,你应该使用setnames,如下:

setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))

或:

cols <- names(DT)[3:10]
setnames(DTmean, cols, paste0(cols, '_mean'))

此外,您不需要.SDcols 语句,因为您正在聚合所有其他列。因此,使用 DT[, lapply(.SD,mean), by = .(town,tc)] 应该会得到与使用 DT[, (lapply(.SD,mean)), by = .(town,tc), .SDcols=3:10] 相同的结果。


在以下示例数据集上:

set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
                 tc = rep(c('C','D'), 10),
                 one = rnorm(20,1,1),
                 two = rnorm(20,2,1),
                 three = rnorm(20,3,1),
                 four = rnorm(20,4,1),
                 five = rnorm(20,5,2),
                 six = rnorm(20,6,2),
                 seven = rnorm(20,7,2),
                 total = rnorm(20,28,3))

使用:

DTmean <- DT[, lapply(.SD,mean), by = .(town,tc)]
setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))

给予:

> DTmean
   town tc  one_mean two_mean three_mean four_mean five_mean six_mean seven_mean total_mean
1:    A  C 1.7368898 1.883586   3.358440  4.849896  4.742609 5.089877   6.792513   29.20286
2:    A  D 0.8906842 1.826135   3.267684  3.760931  6.210145 7.320693   5.571687   26.56142
3:    B  C 1.4037955 2.474836   2.587920  3.719658  3.446612 6.510183   8.309784   27.80012
4:    B  D 0.8103511 1.153000   3.360940  3.945082  5.555999 6.198380   8.652779   28.95180

回复您的评论:如果您想同时计算平均值和sd,您可以这样做(改编自我的回答here):

DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x))))), by = .(town,tc)]

给出:

   town tc  one.mean    one.sd two.mean    two.sd three.mean  three.sd four.mean  four.sd five.mean   five.sd six.mean    six.sd seven.mean seven.sd total.mean total.sd
1:    A  C 0.2981842 0.3556520 1.578174 0.7788545   2.232366 0.9047046  4.896201 1.238877  4.625866 0.7436584 7.607439 1.7262628   7.949366 1.772771   28.94287 3.902602
2:    A  D 1.2099018 1.0205252 1.686068 1.5497989   2.671027 0.8323733  4.811279 1.404794  7.235969 0.7883873 6.765797 2.7719942   6.657298 1.107843   27.42563 3.380785
3:    B  C 0.9238309 0.6679821 2.525485 0.8054734   3.138298 1.0111270  3.876207 0.573342  3.843140 2.1991052 4.942155 0.7784024   6.783383 2.595116   28.95243 1.078307
4:    B  D 0.8843948 0.9384975 1.988908 1.0543981   3.673393 1.3505701  3.957534 1.097837  2.788119 1.9089660 6.463784 0.7642144   6.416487 2.041441   27.88205 3.807119

但是,以长格式存储它很可能会更好。为此,您可以使用data.tablemelt 函数,如下所示:

cols <- names(DT)[3:10]
DT2 <- melt(DT[, as.list(unlist(lapply(.SD, function(x) list(mn = mean(x), sdev = sd(x))))), by = .(town,tc)], 
            id.vars = c('town','tc'), 
            measure.vars = patterns('.mn','.sdev'),
            value.name = c('mn','sdev'))[, variable := cols[variable]]

或者更简单的操作:

DT2 <- melt(DT, id.vars = c('town','tc'))[, .(mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)]

导致:

> DT2
    town tc variable         mn      sdev
 1:    A  C      one  0.2981842 0.3556520
 2:    A  D      one  1.2099018 1.0205252
 3:    B  C      one  0.9238309 0.6679821
 4:    B  D      one  0.8843948 0.9384975
 5:    A  C      two  1.5781743 0.7788545
 6:    A  D      two  1.6860675 1.5497989
 7:    B  C      two  2.5254855 0.8054734
 8:    B  D      two  1.9889082 1.0543981
 9:    A  C    three  2.2323655 0.9047046
10:    A  D    three  2.6710267 0.8323733
11:    B  C    three  3.1382982 1.0111270
12:    B  D    three  3.6733929 1.3505701
.....

响应您最后的 cmets,您可以按如下方式检测异常值:

DT3 <- melt(DT, id.vars = c('town','tc'))
DT3[, `:=` (mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)
    ][, outlier := +(value < mn - sdev | value > mn + sdev)]

给出:

     town tc variable      value         mn     sdev outlier
  1:    A  C      one  0.5681578  0.2981842 0.355652       0
  2:    A  D      one  0.5528128  1.2099018 1.020525       0
  3:    A  C      one  0.5214274  0.2981842 0.355652       0
  4:    A  D      one  1.4171454  1.2099018 1.020525       0
  5:    A  C      one  0.5820994  0.2981842 0.355652       0
 ---                                                        
156:    B  D    total 23.4462542 27.8820524 3.807119       1
157:    B  C    total 30.5934956 28.9524305 1.078307       1
158:    B  D    total 30.5618759 27.8820524 3.807119       0
159:    B  C    total 27.5940307 28.9524305 1.078307       1
160:    B  D    total 24.8378437 27.8820524 3.807119       0

【讨论】:

  • 谢谢!!您对 SDcols 的评论也很有意义,我应该避免它。我还试图生成列 3:10 的标准偏差(sd)......但它似乎不适用于我想要让均值和 sd 都保持原始值的同一组列。如何做到这一点?
  • 我可能可以为 sd 创建一个单独的数据表。 DTsd
  • 谢谢马克西姆斯!这很有帮助。我通过代码扩展以生成少​​量自然日志,然后想检查这些值是否在上限和下限(2 sigma 方法)。现在我通过编写多行我粘贴在下面的 if 条件来实现它,
  • 谢谢马克西姆斯!这很有帮助。我通过代码扩展以生成少​​量自然日志,然后想检查这些值是否在上限和下限(2 sigma 方法)。现在我已经通过为我在下面粘贴的每一列编写多行 if 条件来实现它,有没有办法用几行写这个?
  • 'DTAoutlier = one.log.lowlimit & one.log = two.log.lowlimit & two.log = three.log.lowlimit & three.log
猜你喜欢
  • 2011-09-18
  • 2018-07-26
  • 1970-01-01
  • 1970-01-01
  • 2019-12-02
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多