【问题标题】:How to transform multi rows columns and melt to long form R data.table如何将多行列转换为长格式 R data.table
【发布时间】:2020-04-23 04:04:32
【问题描述】:

有从宽表格变成长表格的表格。它包含 +200 列,由多列组成,如下所示:

原始数据:

# dt
dt <- data.table("1" = c(NA,"Place","dan","uan","yan"),
                 "2" = c(NA,"Place_2","adan","duan","eyan"),
                 "3" = c("something","Male",1253,6643,4325),
                 "4" = c(1998,"Female",624,623,55),
                 "5" = c(NA,"Trans",13,51,51),
                 "6" = c("something2","Male",126,63643,725),
                 "7" = c(1999,"Female",284,243,557),
                 "8" = c(NA,"Trans",138,541,11))

从第3列开始,每+3列就是年份值

dt[1,c(3:ncol(dt) %% 3 == 1),with = FALSE]

如何有效地将多列转换为单列进行熔解?

目标

Place Place_2   Sex     Year    num
dan   adan      Male    1998    1253
dan   adan      Female  1998    624
dan   adan      Trans   1998    13
dan   adan      Male    1999    126
dan   adan      Female  1999    63643
dan   adan      Trans   1999    725
uan   duan      Female  1998    6643
....

【问题讨论】:

  • 是的,只有第一行包含年份值,以及三个类型的中间。就像我们通常在一些带有双列名称的 excel 表中看到的那样。

标签: r data.table multiple-columns melt


【解决方案1】:

这是我尝试过的。我认为排列列名是这里的关键。我在下面的代码中提供了解释。

library(data.table)

# Creat new column names. Get the 1st row, search for years, repeat each year
# three times, and paste them with three levels of sex.

unlist(dt[1,]) %>% 
grep(pattern = "\\d{4}", value = TRUE) %>% 
rep(each = 3) %>% 
paste(., c("Male", "Female", "Trans"), sep = "_") -> foo

# Set new column names.
setnames(dt, c("Place_1", "Place_2", foo))

# Then, transform the data into a long-format data. Create two new columns
# (i.e., year and sex), and remove the column, variable.

melt(dt[-(1:2)], id.vars = 1:2, measure = patterns("^\\d{4}"))[,
        c("year", "sex") := tstrsplit(variable, "_", fixed = TRUE)][, -"variable"] -> out

# Sort the result with Place_1 and Place_2. (This is for showing the result). 
out[order(Place_1, Place_2)][]

#    Place_1 Place_2 value year    sex
# 1:     dan    adan  1253 1998   Male
# 2:     dan    adan   624 1998 Female
# 3:     dan    adan    13 1998  Trans
# 4:     dan    adan   126 1999   Male
# 5:     dan    adan   284 1999 Female
# 6:     dan    adan   138 1999  Trans
# 7:     uan    duan  6643 1998   Male
# 8:     uan    duan   623 1998 Female
# 9:     uan    duan    51 1998  Trans
#10:     uan    duan 63643 1999   Male
#11:     uan    duan   243 1999 Female
#12:     uan    duan   541 1999  Trans
#13:     yan    eyan  4325 1998   Male
#14:     yan    eyan    55 1998 Female
#15:     yan    eyan    51 1998  Trans
#16:     yan    eyan   725 1999   Male
#17:     yan    eyan   557 1999 Female
#18:     yan    eyan    11 1999  Trans

【讨论】:

  • 干净的方法没有任何麻烦
【解决方案2】:

data.table 的结构并不常见。这是一种方法。第 1-3 步是为melt 准备data.table

如果您需要最终输出看起来与您的预期输出完全相同,您可能需要在melt 之前创建一个 ID 列,并根据多个列订购dt5。如果您需要帮助,请告诉我。

library(data.table)

# Step 1: Get the year value
col_num <- which(c(3:ncol(dt) %% 3 == 2)) + 1
year_vec <- as.numeric(as.vector(t(dt[1, ..col_num])))

# Step 2: Create all combinations of year and Male, Female, and Trans
year_sex_dt <- CJ(as.vector(t(dt[2, 3:5])), year_vec)
year_sex_dt[, V1 := factor(V1, levels = c("Male", "Female", "Trans"))]
keycol <- c("year_vec", "V1")
setorderv(year_sex_dt, keycol)
new_name <- paste(year_sex_dt[, V1], year_sex_dt[, year_vec], sep = "_")

# Step 3: Assign column names
dt2 <- setnames(dt[c(-1, -2)], c(as.vector(t(dt[2, 1:2])), new_name)) 

# Step 4: melt the data.table 
dt3 <- melt(dt2, id.vars = 1:2, variable.name = "Sex_Year", value.name = "num")
dt4 <- dt3[, c("Sex", "Year") := tstrsplit(Sex_Year, "_", fixed = TRUE)]
dt4[, Sex_Year := NULL]
dt5 <- dt4[, c("Place", "Place_2", "Sex", "Year", "num")]

head(dt5)
#    Place Place_2    Sex Year  num
# 1:   dan    adan   Male 1998 1253
# 2:   uan    duan   Male 1998 6643
# 3:   yan    eyan   Male 1998 4325
# 4:   dan    adan Female 1998  624
# 5:   uan    duan Female 1998  623
# 6:   yan    eyan Female 1998   55

【讨论】:

  • 谢谢!完美的 。你能解释一下 t() 和 CJ() 到底是什么吗?基于方法 t() 是执行拆分 1 结果为元素(直接形式?)和 CJ() 将每两列相乘所有出现(也适用于超过 2 列?)。由于没有添加 ,unique = T 参数,CJ() 将 +200 列的真实数据重复到 +19k 行中。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2019-04-13
  • 2017-04-21
  • 1970-01-01
相关资源
最近更新 更多