【问题标题】:R: Comparing Subgroups From Different DatasetsR:比较来自不同数据集的子组
【发布时间】:2023-02-03 14:18:50
【问题描述】:

我正在使用 R 编程语言。

我有以下数据集,其中包含加拿大人的身高和体重 - 使用身高值 (cm),我根据 ntiles 将体重 (kg) 分成 bin,并计算每个 ntile bin 中 var2 的平均值:

library(dplyr)
library(gtools)
set.seed(123)
canada = data.frame(height =  rnorm(10000,150,10), weight = rnorm(10000,90, 10))

Part_1 = canada %>% 
  mutate(quants = quantcut(weight, 100),
         rank = as.numeric(quants)) %>%
  group_by(quants) %>% 
  mutate(min = min(weight), max = max(weight), count = n(), avg_height = mean(height))

Part_1 = Part_1 %>% distinct(rank, .keep_all = TRUE)

> Part_1
# A tibble: 100 x 8
# Groups:   quants [100]
   height weight quants         rank   min   max count avg_height
    <dbl>  <dbl> <fct>         <dbl> <dbl> <dbl> <int>      <dbl>
 1   144.  114.  (110.2,113.9]    99 110.  114.    100       150.
 2   148.   88.3 (88.12,88.38]    44  88.1  88.4   100       149.
 3   166.   99.3 (99.1,99.52]     83  99.1  99.5   100       152.
 4   151.   84.3 (84.14,84.44]    29  84.1  84.4   100       150.

例如,我看到有 100 个人体重在 100.2 - 113.9 公斤之间,这些人的平均身高是 150 厘米

现在,假设我有一个类似的美国人数据集:

set.seed(124)
usa = data.frame(height =  rnorm(10000,150,10), weight = rnorm(10000,90, 10))

我的问题:根据我使用加拿大数据集计算的体重范围——我想知道有多少美国人属于这些加拿大范围内,以及美国人在这些加拿大范围内的平均体重是多少

例如:

  • 在加拿大的数据集中,我看到有100个人体重在100.2-113.9公斤之间,这些人的平均身高是150厘米
  • 有多少美国人的体重介于 100.2 - 113.9 公斤之间,这些美国人的平均身高是多少?

我知道我可以为每个级别手动执行此操作:

americans_in_canadian_rank99 = usa %>% 
  filter(weight > 110.2 & weight < 113.9) %>% 
  group_by() %>% 
  summarize(count = n(), avg_height = mean(height))


   americans_in_canadian_rank44 = usa %>% 
      filter(weight > 88.1 & weight < 88.4) %>% 
      group_by() %>% 
      summarize(count = n(), avg_height = mean(height))

最后,我会寻找像这样的所需输出的东西:

# number of rows should be = number of unique ranks
  canadian_rank min_weight max_weight canadian_count canadian_avg_height american_count american_avg_height
1            99      110.2      113.9            100                 150            116                 150
2            44       88.1       88.4            100                 149            154                 150

有人可以帮我想出更好的方法吗?

谢谢!

【问题讨论】:

    标签: r


    【解决方案1】:

    使用 data.table 你可以这样做:

    library(data.table)
    library(stringr)
    
    dt1 <- as.data.table(usa)
    dt1 <- dt1[, c("min", "max") := weight]
    
    dt2 <- as.data.table(Part_1 %>% select("quants", "rank"))
    dt2 <- cbind(dt2[,.(rank)], 
                 setDT(tstrsplit(str_sub(dt2$quants, 2, -2), ",", fixed = TRUE, names = c("min", "max"))))
    dt2 <- dt2[, lapply(.SD, as.numeric)]
    setkey(dt2, min, max)
    
    dt1 <- dt1[, rank := dt2$rank[foverlaps(dt1, dt2, by.x = c("min", "max"), by.y = c("min", "max"), which = TRUE)$yid]] %>% 
      select(-c("min", "max"))
    

    编辑

    完全错过了最后一部分。但如果你想这样做,从最后一点开始应该相对简单(如果你愿意,你可以使用dplyr):

    dt3 <- rbind(canada %>% 
                   mutate(quants = quantcut(weight, 100),
                          rank = as.numeric(quants),
                          country = "Canada") %>%
                   as.data.table(),
                 copy(dt1)[, country := "USA"], fill = TRUE)
    dt3 <- dt3[,.(count = .N, avg_height = mean(height)), by = c("rank", "country")] %>% 
      dcast(rank ~ country, value.var = c("count", "avg_height")) %>% 
      merge(dt2 %>% rename("min_weight" = "min", "max_weight" = "max"), by = c("rank"), all.x = TRUE)
    

    编辑 2

    或者,您可以尝试使用 cut 函数做类似的事情,而无需从 data.table 学习任何东西

    rank_breaks <- Part_1 %>% 
      mutate(breaks = sub(",.*", "", str_sub(quants, 2)) %>% as.numeric()) %>%
      arrange(rank) %>% 
      pull(breaks)
    
    # Here I change minimum and maximum of groups 1 and 100 to -Inf and Inf respectively. 
    # If you do not wish to do so, you can disregard it and run `rank_breaks <- c(rank_breaks, max(canada$weight))` instead  
    rank_breaks[1] <- -Inf
    rank_breaks <- c(rank_breaks, Inf)
    
    usa <- usa %>% 
      mutate(rank = cut(weight, breaks = rank_breaks, labels = c(1:100)))
    

    【讨论】:

    • @Darmist:谢谢你的回答!我运行了这行代码:
    • dt1 <- dt1[, rank := dt2$rank[foverlaps(dt1, dt2, by.x = c("min", "max"), by.y = c("min", "max"), 其中= TRUE)$yid]] %>% select(-c("min", "max"))
    • 我收到以下错误:[.data.table(dt1,, :=(rank, dt2$rank[foverlaps(dt1, dt2, : Supplied 999802 items to be assigned to 10000 items of column 'rank'. 如果你希望“回收”RHS,请使用 rep() 向您的代码读者明确此意图。
    • 你知道我做错了什么吗?太感谢了!
    • 如果没有看到我很难说到底发生了什么,我的猜测是您没有完全运行代码或运行一些额外的代码,因为它在我的机器上运行良好。但是如果出于某种原因不是这种情况,我添加了一个替代解决方案,它应该更容易理解和简单(并且适用于dplyr
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-04-12
    • 1970-01-01
    • 1970-01-01
    • 2020-03-09
    • 1970-01-01
    • 2023-04-08
    • 1970-01-01
    相关资源
    最近更新 更多