【问题标题】:R list to wide (sparse) data frameR列表到宽(稀疏)数据框
【发布时间】:2015-11-23 19:20:53
【问题描述】:

我第一次来这里,所以我希望我不会破坏任何东西...... 我有一个列表列表:

Browse[2]> head(str(mylist))
List of 33
 $ : chr [1:33] "0001" "space" "28" "night_club" ...
 $ : chr [1:33] "0002" "concert" "28" "night_club" ...
 $ : chr [1:31] "0003" "night_club" "24" "martial_arts" ...
 $ : chr [1:31] "0004" "stage" "24" "basketball" ...
 $ : chr [1:43] "0005" "night_club" "16" "concert" ...
 $ : chr [1:43] "0006" "night_club" "16" "concert" ...
 $ : chr [1:39] "0007" "night_club" "22" "concert" ...
 $ : chr [1:39] "0008" "night_club" "22" "concert" ...
 $ : chr [1:31] "0009" "night_club" "46" "martial_arts" ...
 $ : chr [1:31] "0010" "night_club" "46" "martial_arts" ...
 $ : chr [1:41] "0011" "night_club" "17" "martial_arts" ...
 $ : chr [1:41] "0012" "night_club" "17" "martial_arts" ...
 $ : chr [1:29] "0013" "concert" "23" "night_club" ...
 $ : chr [1:29] "0014" "concert" "23" "night_club" ...
 $ : chr [1:25] "0015" "night_club" "26" "concert" ...
 $ : chr [1:31] "0016" "night_club" "42" "concert" ...
 $ : chr [1:31] "0017" "night_club" "42" "concert" ...
 $ : chr [1:31] "0018" "night_club" "25" "wrestling" ...
 $ : chr [1:31] "0019" "night_club" "25" "wrestling" ...
 $ : chr [1:33] "0020" "night_club" "46" "wrestling" ...
 $ : chr [1:33] "0021" "night_club" "46" "wrestling" ...
 $ : chr [1:41] "0022" "concert" "21" "stage" ...
 $ : chr [1:41] "0023" "concert" "21" "stage" ...
 $ : chr [1:55] "0024" "basketball" "8" "concert" ...
 $ : chr [1:55] "0025" "basketball" "8" "concert" ...
 $ : chr [1:37] "0026" "bald_person" "26" "martial_arts" ...
 $ : chr [1:37] "0027" "bald_person" "26" "martial_arts" ...
 $ : chr [1:37] "0028" "night_club" "32" "business_meeting" ...
 $ : chr [1:37] "0029" "night_club" "32" "business_meeting" ...
 $ : chr [1:15] "0030" "night_club" "59" "stage" ...
 $ : chr [1:37] "0031" "stage" "12" "night_club" ...
 $ : chr [1:37] "0032" "stage" "12" "night_club" ...
 $ : chr [1:33] "0033" "night_club" "23" "portrait" ...

我想将此列表转换为宽格式数据框,其中第一列将是每个内部列表第一个元素(即“0001”、“0002”等),并且所有可能的列都存在类别在文件中: “空间”、“夜总会”、“音乐会”、“婚姻艺术”、“摔跤”等。 这意味着我将有一个非常宽的数据框,每行将以某个 id (0001,0002,0003 ...) 开头,列名将再次是文件中的所有类别:“space”、“night_club”、“concert "、"marital_arts"、"wrestling" 等,对于每一行,如果该 id 存在类别,它将填充列表中类别旁边的值(例如,第一行中的 "space" -> 28) .

我试图用循环构造一个规范化的数据框,然后将其转换为宽格式,但随着数据规模的扩大,这将是一个坏主意:

for (file in files){# iterate over files in folder

    mylist <- strsplit(readLines(file), ":")
    #close(mylist)
    for (elem in mylist){
      dataframe <- data.frame(frameid = numeric(), category = character(), nrow = length(unlist(elem)))
      frameid <- rep.int(elem[[1]], length(elem)-1) 
      categories <- elem[-1:-1]
      dataframe$frameid <- frameid
      dataframe$category <- categories
    }
  }

可重现的输入输出示例: 输入输出:

 list(c("0001", "space", "28", "night_club", "25"), c("0002", 
"concert", "28", "night_club", "26"), c("0003", "night_club", 
"24", "martial_arts", "27"), c("0004", "stage", "24", "basketball", 
"30"))

输出:

Dataframe
frameid, cat_space, cat_night_club, cat_concert, cat_martial_arts, cat_stage, cat_basketball
0001, 28, 25, 0, 0, 0, 0
0002, 0, 26, 28, 0, 0, 0
0003, 0, 24, 0, 27, 0, 0
0004, 0, 0, 0, 0, 24, 30

【问题讨论】:

标签: r list dataframe


【解决方案1】:

这是一种可能性。我将答案创建为一个函数,并评论了每个阶段发生的事情。基本思路是:

  1. 创建一个仅包含每个列表元素的第一个项目的列。
  2. 创建一个包含其余项目的两列矩阵。这假设数据配对良好。
  3. 创建一个将这两个元素放在一起的data.frame
  4. 使用xtabs 将输出转换为宽格式。请注意,如果“ID”和“var”的组合重复,则由于使用了xtabs,这些值将被加在一起。

函数如下:

myFun <- function(inList) {
  ## Extract the first value in each list element
  ID <- vapply(inList, `[`, character(1L), 1)
  ## Convert the remaining elements into a two column matrix, first
  ##   column as variable, second column as value. Bind all list
  ##   elements together to a single 2-column mantrix.
  varval <- do.call(rbind, lapply(inList, function(x) {
    matrix(x[-1], ncol = 2, byrow = TRUE, dimnames = list(NULL, c("var", "val")))
  }))
  ## Create a data.frame where ID is repeated to the same number of rows
  ##   as the matrices found in varval.
  temp <- data.frame(ID = rep(ID, (lengths(inList)-1)/2), varval)
  ## Convert the val columns to numeric
  temp$val <- as.numeric(as.character(temp$val))
  ## Use xtabs to go from a "long" form to a "wide" form
  xtabs(val ~ ID + var, temp)
}

这里将它应用于您的示例数据(假设您的数据称为“L”):

myFun(L)
#       var
# ID     basketball concert martial_arts night_club space stage
#   0001          0       0            0         25    28     0
#   0002          0      28            0         26     0     0
#   0003          0       0           27         24     0     0
#   0004         30       0            0          0     0    24

【讨论】:

  • 谢谢。看起来不错。我会在整场比赛中尝试并观看表演。
  • 你没有提到的一件事是如何将规范化数据帧转换为宽格式数据帧,但我想这并不难
  • @Neuril???输出不是很宽吗? (见第 4 点)。还是您指的是它是矩阵而不是data.frame?如果担心,请尝试as.data.frame.matrix(myFun(L))
猜你喜欢
  • 1970-01-01
  • 2012-08-15
  • 1970-01-01
  • 2015-01-16
  • 2019-03-10
  • 2021-04-14
  • 2023-03-19
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多