字符向量列表的混合数据帧到统一数据帧中答案

【问题标题】：mixed dataframe of list of character vectors into uniform dataframe字符向量列表的混合数据帧到统一数据帧中
【发布时间】：2016-08-20 00:06:41
【问题描述】：

我正在尝试使用 stringr 包将字符串分解为列。

> df <- dput(head(facs,3))
structure(list(geo_accession = structure(1:3, .Names = c("V2", 
"V3", "V4"), .Label = c("GSM1494875", "GSM1494877", "GSM1494879", 
"GSM1494881", "GSM1494883", "GSM1494885", "GSM1494887", "GSM1494889", 
"GSM1494891", "GSM1494893", "GSM1494895", "GSM1494897", "GSM1494899", 
"GSM1494901", "GSM1494903", "GSM1494906", "GSM1494908", "GSM1494910", 
"GSM1494912", "GSM1494914", "GSM1494917", "GSM1494919", "GSM1494921", 
"GSM1494923"), class = "factor"), title = structure(1:3, .Names = c("V2", 
"V3", "V4"), .Label = c("2818 - Akt treated TIL - repeat 1 - mAdbID:125971", 
"2818 - Akt treated TIL - repeat 2 - mAdbID:125972", "2818 - Akt treated TIL - repeat 3 - mAdbID:125973", 
"2818 - Akt treated TIL - repeat 4 - mAdbID:125974", "2818 - Untreated TIL - repeat 1 - mAdbID:125975", 
"2818 - Untreated TIL - repeat 2 - mAdbID:125976", "2818 - Untreated TIL - repeat 3 - mAdbID:125977", 
"2818 - Untreated TIL - repeat 4 - mAdbID:125978", "3289 - Akt treated TIL - repeat 1 - mAdbID:125979", 
"3289 - Akt treated TIL - repeat 2 - mAdbID:125980", "3289 - Akt treated TIL - repeat 3 - mAdbID:125981", 
"3289 - Akt treated TIL - repeat 4 - mAdbID:125982", "3289 - Untreated TIL - repeat 1 - mAdbID:125983", 
"3289 - Untreated TIL - repeat 2 - mAdbID:125984", "3289 - Untreated TIL - repeat 3 - mAdbID:125985", 
"3289 - Untreated TIL - repeat 4 - mAdbID:125986", "3784 - Akt treated TIL - repeat 1 - mAdbID:125987", 
"3784 - Akt treated TIL - repeat 2 - mAdbID:125988", "3784 - Akt treated TIL - repeat 3 - mAdbID:125989", 
"3784 - Akt treated TIL - repeat 4 - mAdbID:125990", "3784 - Untreated TIL - repeat 1 - mAdbID:125991", 
"3784 - Untreated TIL - repeat 2 - mAdbID:125992", "3784 - Untreated TIL - repeat 3 - mAdbID:125993", 
"3784 - Untreated TIL - repeat 4 - mAdbID:125994"), class = "factor")), .Names = c("geo_accession", 
"title"), row.names = c("GSM1494875", "GSM1494877", "GSM1494879"
), class = "data.frame")

现在它是一个 2 列数据框。

> sapply(df, class)
geo_accession         title 
 "factor"      "factor"

如果我使用str_split_fixed，

> df$title = str_split_fixed(df[,"title"], " - ", 4)
> df
           geo_accession title.1         title.2  title.3       title.4
GSM1494875    GSM1494875    2818 Akt treated TIL repeat 1 mAdbID:125971
GSM1494877    GSM1494877    2818 Akt treated TIL repeat 2 mAdbID:125972

> sapply(df, class)
geo_accession         title 
 "factor"      "matrix"

我希望看到的是 5 列数据框，而不是 2 列混合数据框（因子、矩阵）。

> df
       geo_accession title1          title2   title3        title4
GSM1494875    GSM1494875   2818 Akt treated TIL repeat 1 mAdbID:125971
GSM1494877    GSM1494877   2818 Akt treated TIL repeat 2 mAdbID:125972

> sapply(df, class)
geo_accession       title.1       title.2       title.3       title.4 
 "factor"      "factor"      "factor"      "factor"      "factor"

我不知道如何清理混合数据框。如何将其转换为 5 列数据框？

【问题讨论】：

见?tidyr::separate
请提供一个可重现的例子
请使用dput按照R标签描述以可重现的方式分享您的数据
感谢其他 SO 用户的回答！现在这是一种可接受的格式吗？

标签： r dataframe stringr mixed

【解决方案1】：

您可以使用tidyr 包中的separate：

library(tidyr)
df %>% separate(title, paste0('title',1:4) ,' - ')

           geo_accession title1          title2   title3        title4
1 GSM1494875  GSM1494875   2818 Akt treated TIL repeat 1 mAdbID:125971
2  GSM1494877 GSM1494877   2818 Akt treated TIL repeat 2 mAdbID:125972

【讨论】：

效果很好！能够将操作串在一起使代码对我来说更具可读性:)

【解决方案2】：

str_split_fixed 的输出是vectors 的list。将list 分配给单个列是问题的原因。一种方法是将其分配给多个列，否则我们可以使用do.call(data.frame 修复它，即

facs1 <- do.call(data.frame, facs)

或者我们可以在删除- 前后的空格之后使用base R 方法和read.csv/read.table，然后在第一列中使用cbind

facs2 <- cbind(facs[1], read.csv(text=gsub("\\s+-\\s+", "-", facs$title), 
          sep="-", header=FALSE, col.names = paste0("title.", 1:4), stringsAsFactors=FALSE))
facs2
#           geo_accession title.1         title.2  title.3       title.4
#GSM1494875    GSM1494875    2818 Akt treated TIL repeat 1 mAdbID:125971
#GSM1494877    GSM1494877    2818 Akt treated TIL repeat 2 mAdbID:125972

str(facs2)
#'data.frame':   2 obs. of  5 variables:
#$ geo_accession: chr  "GSM1494875" "GSM1494877"
#$ title.1      : int  2818 2818
#$ title.2      : chr  "Akt treated TIL" "Akt treated TIL"
#$ title.3      : chr  "repeat 1" "repeat 2"
#$ title.4      : chr  "mAdbID:125971" "mAdbID:125972"

或者一个紧凑的选项是cSplit from splitstackshape

library(splitstackshape)
cSplit(facs, "title", " - ")

【讨论】：

感谢您的精彩解释和其他方法！