如何将用'-'分隔的文本转换为数据框答案

【问题标题】：How to convert text separated with '-' to dataframe如何将用'-'分隔的文本转换为数据框
【发布时间】：2017-02-12 08:37:01
【问题描述】：

数据

我有一个包含如下数据的文本文件：

PMID- 26460049
AB  - Psoriasis is an inflammatory skin disease affecting 2–3% of the population.
AID - 201509971 [pii]
AID - 10.1073/pnas.1509971112 [doi]

PMID- 26999594
AB  - Psoriasis is an autoimmune skin disease that is associated with aberrant activity of immune cells and keratinocytes.
AID - PONE-D-15-54115 [pii]
AID - 10.1371/journal.pone.0151913 [doi]

有更多的字段和记录，但模式是有一个变量名、一个“-”分隔符、一些文本，可能是多行长，然后是数据值。

期望的输出

我想创建一个这样的数据框：

PMID      AB                AID                    AID
26460049  Psoriasis is...   201509971 [pii]        10.1073/pnas.1509971112 [doi]
26999594  Psoriasis is...   PONE-D-15-54115 [pii]  10.1371/journal.pone.0151913 [doi]

基本上是换位并将“-”分隔符识别为变量名和数据值之间的分界线。

问题

如何转置我的文件并分离出这些数据的变量和数据值？

【问题讨论】：

请您使用dput(your_data)提供您的数据
看起来- 既作为列分隔符又作为字符字段的一部分出现。如果您在- 列标识符（即- ）之后有一个空格，那么@tobiasegli_te 的答案将起作用。使用read.csv("/dir/data.csv", sep="- ") 会更短，前提是您的列由- 分隔，而不是-。
@MikeRSpencer，我认为read.csv 只能接受单个字符sep 参数。我也不确定它将如何处理数据中的空行。
关于缺失的行非常真实！我并没有太在意。我关于 - 显示为数据字符的评论仍然有效。
@MikeRSpencer 绝对有效，正如您所说，sep = "- " 会很棒，如果只有read.csv 会接受它。我选择了readLines，然后是strsplit，这可能会慢很多，而且肯定更“笨拙”。

标签： r text dataframe data-conversion

【解决方案1】：

假设您的文本文件的其余部分具有相同的结构，您可以这样做：

txt <- readLines('input.txt')
dat <- data.frame(pmid = sub('PMID- ', '', txt[grep('PMID-',txt)]),
                  ab = sub('AB  - ', '', txt[grep('AB  -',txt)]),
                  pii = sub('AID - ', '', txt[grep('[pii]',txt,fixed=TRUE)]),
                  doi = sub('AID - ', '', txt[grep('[doi]',txt,fixed=TRUE)]))

给出：

> dat
      pmid                                                                                                                   ab                   pii                                doi
1 26460049                                          Psoriasis is an inflammatory skin disease affecting 2–3% of the population.       201509971 [pii]      10.1073/pnas.1509971112 [doi]
2 26999594 Psoriasis is an autoimmune skin disease that is associated with aberrant activity of immune cells and keratinocytes. PONE-D-15-54115 [pii] 10.1371/journal.pone.0151913 [doi]

使用data.table 和splitstackshape 包的另一种可能的解决方案：

txt <- readLines('input.txt')

library(splitstackshape) # which also loads 'data.table'
DT <- data.table(txt = txt)[txt!='']
DT <- cSplit(DT, 'txt', sep = '- ', 'wide')

dcast(DT[, `:=` (pmid = txt_2[txt_1=='PMID'][cumsum(txt_1=='PMID')], 
                 txt_1 = sprintf('%s%s',txt_1,c('','','1','2')))][txt_1!='PMID'],
      pmid ~ txt_1, value.var = 'txt_2')

【讨论】：

【解决方案2】：

假设您的列的数量、顺序和名称在所有条目中都是一致的，我会这样做：

data <- readLines("~/Desktop/data.txt")

PMID_ind <- grep("PMID", data)
data_new <- t(sapply(PMID_ind,function(i_PMID){
    entry_tmp <- data[i_PMID:(i_PMID+3)]
    sapply(entry_tmp,function(split_tmp){
        strsplit(split_tmp,"- ")[[1]][2]
    }, USE.NAMES = F)
}))
colnames(data_new) <- c("PMID","AB","AID1","AID2")
data_new

【讨论】：

【解决方案3】：

我看到弹出了另一个答案，但我还是会发布这个，因为我已经完成了。我几乎可以肯定这不是最好的方法，但它确实有效：

new <- readLines( file )
new <- strsplit( new, split = "- " )

new <- data.frame( type = sapply( new, "[", 1 ),
                   data = sapply( new, "[", 2 ),
                   stringsAsFactors = FALSE )

new <- new[ !is.na( new$type ), ]

new$column.number <- c( 1,2,3,4 )

output <- data.frame(
    PMID = new$data[ new$column.number == 1 ],
    AB = new$data[ new$column.number == 2 ],
    AID.1 = new$data[ new$column.number == 3 ],
    AID.2 = new$data[ new$column.number == 4 ],
    stringsAsFactors = FALSE
)

根据@tobiasegli_te 的回答，我假设“列”的数量是恒定的，数据中表示的这些列的顺序也是如此。需要该假设的事实是您的两个“列”具有相同的名称，这意味着正确关联它们的唯一（？）方法是将它们的位置（在其他数据中）视为重要。我不喜欢这样，因为它 a) 难以为其制定通用解决方案，并且 b) 容易产生严重错误。

【讨论】：