R读取逗号分隔的txt文件，一列内有逗号答案

【问题标题】：R read comma delimited txt file with comma inside one columnR读取逗号分隔的txt文件，一列内有逗号
【发布时间】：2014-09-30 08:46:24
【问题描述】：

我有一些用户浏览行为的日志。它来自数据收集器，显然他使用逗号分隔变量。但是，某些 URL 内部确实有逗号。我无法将 txt 文件读入 R。

20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank

上面的网址应该是：

http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1

https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp

我如何告诉 R 每行正好有 10 个变量并将逗号放在 URL 中？谢谢！

df <- read.table('2009.txt', sep= ',', quote= '', comment.char= '', stringsAsFactors= F)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : line 130 did not have 10 elements

【问题讨论】：

我假设您无法从源头解决此问题（让它提供有效的 CSV 数据）？
另外，每行是否有多个以http 开头的字段实例？
感谢@TimPietzcker！不...我无法更改来源。否则，我会要求他使用制表符而不是询问 SO。我是从二手的二手货那里得到的文件……非常复杂。只有一个 URL 列，我认为这里不会有两个 https。除非有人访问过ftp://nas.myserv.ip...我还没有遇到过。
最后一行有 14 列。我快到了，但最后一行应该是什么？
@RichardScriven 不，网址是http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT，逗号在网址内

标签： r url csv

【解决方案1】：

你可以试试：

  dat <- read.table(text=gsub("http:.*(?=(,www)|,,)(*SKIP)(*F)|,", "*",
           Lines, perl=TRUE), sep="*", header=FALSE, stringsAsFactors=FALSE)


  dat
  #    V1                  V2 V3           V4                            V5
  #1 20091 2009-06-02 22:06:14 84   taobao.com            search1.taobao.com
  #2 20092 2009-06-16 12:25:35  8     sohu.com              www.wap.sohu.com
  #3 20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
   #                     V6
  #1               http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq------- 2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
  #2       http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
  #3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
  #            V7       V8            V9        V10
  #1 www.taobao.com shopping    e-commerce        C2C
  #2   www.sohu.com   portal entertainment     mobile
  #3                  others     marketing enterprise

数据

 Lines <-  readLines(textConnection(txt)) #(`txt` from @Richard Scriven)

更新

使用您的新数据集

 indx <- grep("http", Lines)
 Lines1 <- Lines[indx]
 pat1 <- paste(unique(gsub(".*http[s]?.{3}(\\w+)\\..*", "\\1", Lines1)), collapse="|")
 pat1N <-  paste0("http:.*(?=,(", pat1, "|,))(*SKIP)(*F)|,") 

 dat1 <-  read.table(text=gsub(pat1N, "*", Lines, perl=TRUE),
                   sep="*", header=FALSE, stringsAsFactors=FALSE)

 dat1
 #           V1                  V2 V3           V4                            V5
 #1   20091 2009-06-02 22:06:14 84   taobao.com            search1.taobao.com
 #2   20092 2009-06-16 12:25:35  8     sohu.com              www.wap.sohu.com
 #3   20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
 #4   20096 2009-06-30 07:51:38  7   taobao.com            search1.taobao.com
 #5 2009184 2009-06-25 14:40:39  6  mktginc.com              surv.mktginc.com
 #6   20092 2009-06-07 15:13:06 32   ccb.com.cn          ibsbjstar.ccb.com.cn
 #                                     V6
 # 1                                            http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
 # 2                                    http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
 # 3                              http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
 # 4 http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1
 #5                                                                                                                                                                         
 #6                                                                                                                       https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp
#                 V7       V8            V9        V10
#1     www.taobao.com shopping    e-commerce        C2C
#2       www.sohu.com   portal entertainment     mobile
#3                      others     marketing enterprise
#4 search1.taobao.com shopping    e-commerce        C2C
#5                     unknown       unknown    unknown
#6                      e-bank       finance     e-bank

数据

 txt <- '20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank'

  Lines <- readLines(textConnection(txt))

【讨论】：

我必须说，这令人印象深刻
@RichardScriven @akrun 感谢两位！我了解您的两个脚本。生产环境混乱。数据也有空的 URL 单元格，例如 2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown。因此，您的两个脚本都会引发错误。 akrun 会说“第 309 行没有 10 个元素”，而 Richard 的原因是 g=0 'g:(length(x) - 4) 中的错误：长度为 0 的参数'。我可以试试 Josh 的 if 流程
@leoce 我的脚本基于您提供的示例。最好展示一个模仿原始数据集的示例。如果您有没有 URL 单元格的那些，我建议您将这些行 grep 并在其他行上使用脚本。稍后使用索引将它们连接在一起。
@akrun 谢谢！我已经更新了我的问题。看来您的正则表达式模式不适合我的问题的第 4 行（文件中的第 2138 行）。是不是因为破折号太多了？还有一件事是我建议将 comment.char = '' 放在函数中，因为 URL 中有 #。
@akrun 百万感谢您的努力！我刚刚在测试中尝试了您的脚本。然后我发现它无法处理https...（也在问题中更新）。也许使用 regexpr 不是一个好主意，因为我们无法穷尽所有的可能性……除非我能读完 115,280 行。

【解决方案2】：

如果您读取的数据使得每一行都是一个字符串（例如sep="\n"），那么您可以在放入适当的数据框之前直接处理每一行。

如果只有第 6 个条目可能有逗号（看起来其他 url 只是主域），那么类似以下内容可能会起作用：

d <- strsplit(d, ",")

for (i in 1:length(d)) {
  x <- d[[i]]
  n <- length(x)
  if (n > 10) {
    d[[i]] <- c(x[1:5], paste(x[6:(n-4)], collapse=","), x[(n-3):n])
  }
}

d <- do.call(rbind,lapply(d, matrix, ncol=10, byrow=TRUE))

如果其他网址可能有问题，这种方法可能仍然有效，但可能会变得非常复杂。

【讨论】：

如果您认为它是补充脚本，请添加我的更新。谢谢，@乔希！ http://stackoverflow.com/review/suggested-edits/5892285

【解决方案3】：

这看起来可能适合你。

txt <- '20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
  20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
  20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise'

readLog <-  function(file, stringsAsFactors = TRUE)
{
    s <- strsplit(readLines(file), ",")
    loop <- t(sapply(s, function(x) {
            g <- grep("http", x)
            x[g] <- paste(x[g:(length(x)-4)], collapse = ",")
            x[-c((g+1):(length(x)-4))]
        }))
    data.frame(loop, stringsAsFactors = stringsAsFactors)
}
## readLog(textConnection(txt))
readLog(yourFile)

这在第 6 列中给出以下内容，每行有 10 列

                                                                        V6
1               http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
2       http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT

7 到 10 是

              V7       V8            V9        V10
1 www.taobao.com shopping    e-commerce        C2C
2   www.sohu.com   portal entertainment     mobile
3                  others     marketing enterprise

【讨论】：