非结构化文本数据到数据框答案

【问题标题】：Unstructured textual Data to Data Frame非结构化文本数据到数据框
【发布时间】：2014-10-01 13:24:27
【问题描述】：

我正在尝试使用 R 将这些文本数据的几行转换为数据框。我无法有效地使用 read.delim。我希望将所有这些行填充到 10 个固定列中，由分隔：谢谢。

*** 
Type:status
Origin: abc
Text: abc
URL: 
ID: 123
Time: Fri Jul 22 15:07:37 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags: 
***
***
Type:status
Origin: cde
Text: rty
URL: http://ocs
ID: 456
Time: Thu Jul 21 14:09:47 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags:  rty
***
***
.
..
...

【问题讨论】：

标签： r text mining

【解决方案1】：

这样的事情可能会起作用：

a <- readLines(textConnection("
*** 
Type:status
Origin: abc
Text: abc
URL: 
ID: 123
Time: Fri Jul 22 15:07:37 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags: 
***
***
Type:status
Origin: cde
Text: rty
URL: http://ocs
ID: 456
Time: Thu Jul 21 14:09:47 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags:  rty
***
***"))


ids <- c("Type", "Origin", "Text", "URL", "ID", "Time", "RetCount", "Favorite", "MentionedEntities", "Hashtags")

sapply(ids, function(id) sapply(strsplit(a[grepl(id[1], a)], ":"), "[[", 2))

【讨论】：

【解决方案2】：

使用此答案将文件作为一个字符串读取：https://stackoverflow.com/a/9069670/1412059

现在像这样处理字符串：

text <- "*** 
Type:status
Origin: abc
Text: abc
URL: 
ID: 123
Time: Fri Jul 22 15:07:37 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags: 
***
***
Type:status
Origin: cde
Text: rty
URL: http://ocs
ID: 456
Time: Thu Jul 21 14:09:47 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags:  rty
***
***"

#replace : with ; to have a nice seperator    
text <- gsub("(?<![[:digit:]p]):", ";", text, perl=TRUE)

dat <- read.table(text=text, sep=";", comment.char = "*", fill=TRUE)
dat$id <- rep(seq_len(nrow(dat)/10), each=10)
library(reshape2)
dcast(dat, id~V1, value.var="V2")
#  id Favorite Hashtags   ID MentionedEntities Origin RetCount Text                          Time   Type         URL
#1  1    false           123                      abc        0  abc  Fri Jul 22 15:07:37 CDT 2011 status            
#2  2    false      rty  456                      cde        0  rty  Thu Jul 21 14:09:47 CDT 2011 status  http://ocs

【讨论】：

如何确保正确读取文本或来源中弹出的冒号。 ?例如来源：cde: andgg 文本：rty: asdnf

【解决方案3】：

这是一个似乎可以完成工作的函数。它不使用分隔符，而是使用readLines 和几个正则表达式。

readData <- function(file, stringsAsFactors = TRUE) 
{
    rl <- readLines(file)                        ## read the file
    rl2 <- rl[!grepl("[*]+", rl)]                ## remove the '***' elements
    sub <- sub("^[A-Za-z]+[:]( ?)+", "", rl2)    ## make the row data
    mat <- matrix(sub, ncol = 10, byrow = TRUE,  ## create a matrix
        dimnames = list(NULL, gsub("[:](.*)", "", rl2[1:10])))  
    as.data.frame(mat, stringsAsFactors = stringsAsFactors)
}

这是使用您的数据运行的，其中文件 "new.txt" 是使用您的示例数据创建的。

readData("new.txt")
#     Type Origin Text        URL  ID                         Time RetCount Favorite MentionedEntities Hashtags
# 1 status    abc  abc            123 Fri Jul 22 15:07:37 CDT 2011        0    false                           
# 2 status    cde  rty http://ocs 456 Thu Jul 21 14:09:47 CDT 2011        0    false                        rty

【讨论】：

【解决方案4】：

您也可以这样做：（使用来自@rengis 的a）

如果您有http 和https 作为URL

text1 <- gsub("(?<=[0-9]|http|https):(*SKIP)(*F)|:", ";", a, perl=TRUE)
text2 <- text1[!grepl("\\*|^$", text1)]
res <- do.call(data.frame,c(split(gsub(".*; ?", "", text2),
              gsub(";.*", "", text2)), stringsAsFactors=FALSE))


res
#   Favorite Hashtags  ID MentionedEntities Origin RetCount Text
#1    false          123                      abc        0  abc
#2    false      rty 456                      cde        0  rty
#                         Time   Type        URL
#1 Fri Jul 22 15:07:37 CDT 2011 status           
#2 Thu Jul 21 14:09:47 CDT 2011 status http://ocs

或使用cSplit

library(data.table)
library(devtools)
source_gist(11380733)

 DT <- cSplit(as.data.frame(text2), "text2",";", "wide")[,
                                n:= seq_len(.N), by=text2_1]

 dcast.data.table(DT, n~text2_1, value.var="text2_2")
 #    n Favorite Hashtags   ID MentionedEntities Origin RetCount Text
 # 1: 1    false           123                      abc        0  abc
 # 2: 2    false      rty  456                      cde        0  rty
                             Time   Type         URL
 #1:  Fri Jul 22 15:07:37 CDT 2011 status            
 #2:  Thu Jul 21 14:09:47 CDT 2011 status  http://ocs

更新

基于新信息，即弹出colons：

 a <- readLines(textConnection("
*** 
Type:status
Origin: abc
Text: abc
URL: 
ID: 123
Time: Fri Jul 22 15:07:37 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags: 
***
***
Type:status
Origin: cde: andgg
Text: rty: asndf
URL: http://ocs
ID: 456
Time: Thu Jul 21 14:09:47 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags:  rty
***
***"))

text1 <- gsub("(?<=[0-9]|http|https):(*SKIP)(*F)|^([^:]+):(.*)",
                                             "\\1;\\2", a, perl=TRUE)
text2 <- text1[!grepl("\\*|^$", text1)]
splitGroup <- sub(";.*", "", text2)

res <- do.call(data.frame,c(split(gsub(".*; ?", "", text2),
             factor(splitGroup, levels=unique(splitGroup))), stringsAsFactors=FALSE))
res
#    Type     Origin       Text        URL  ID                         Time
#1 status        abc        abc            123 Fri Jul 22 15:07:37 CDT 2011
#2 status cde: andgg rty: asndf http://ocs 456 Thu Jul 21 14:09:47 CDT 2011
#   RetCount Favorite MentionedEntities Hashtags
#1        0    false                           
#2        0    false                        rty

【讨论】：

如何确保正确读取文本或来源中弹出的冒号。 ?例如来源：cde: andgg 文本：rty: asdnf