【问题标题】：reading badly formed csv in R - mismatched quotes在 R 中读取格式错误的 csv - 不匹配的引号
【发布时间】：2013-03-27 15:56:45
【问题描述】：

我有数百个大型 CSV 文件（每个文件的大小从 10k 行到 100k 行不等），其中一些文件的描述格式不正确，引号内带有引号，因此它们可能看起来像

ID,Description,x
3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486

我需要能够将 R 中的所有这些行清晰地解析为 CSV。 dput()'ing 它并读取...

txt <- c("ID,Description,x",
    "3434,\"abc\"def\",988",
    "2344,\"fred\",3484", 
    "2345,\"fr\"\"ed\",3485",
    "2346,\"joe,fred\",3486")

read.csv(text=txt[1:4], colClasses='character')
    Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
      incomplete final line found by readTableHeader on 'text'

如果我们更改引用并且不包含嵌入逗号的最后一行 - 效果很好

read.csv(text=txt[1:4], colClasses='character', quote='')

但是，如果我们更改引用并在最后一行包含嵌入的逗号...

read.csv(text=txt[1:5], colClasses='character', quote='')
    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
      line 1 did not have 4 elements

EDIT x2：应该说不幸的是一些描述中有逗号 - 代码在上面编辑。

【问题讨论】：

如果有 32 列，只有一部分被引用了怎么办？我应该再问一个问题吗？

标签： r parsing csv

【解决方案1】：

由于在这组讨厌的文件中只有一个引用的列，我可以在每一侧做一个read.csv() 来处理引用列左右其他未引用的列，所以我目前的解决方案基于来自的信息@agstudy 和 @roland

csv.parser <- function(txt) {
    df <- do.call('rbind', regmatches(txt,gregexpr(',"|",',txt),invert=TRUE))
    # remove the header
    df <- df[-1,]
    # parse the left csv
    df1 <- read.csv(text=df[,1], colClasses='character', comment='', header=FALSE)
    # parse the right csv
    df3 <- read.csv(text=df[,3], colClasses='character', comment='', header=FALSE)
    # put them back together
    dfa <- cbind(df1, df[,2], df3)
    # put the header back in
    names(dfa) <- names(read.csv(text=txt[1], header=TRUE))
    dfa
}

# debug(csv.parser)
csv.parser(txt)

谢天谢地，在更广泛的数据集上运行它。

txt <- c("ID,Description,x,y",
         "3434,\"abc\"def\",988,344",
         "2344,\"fred\",3484,3434", 
         "2345,\"fr\"\"ed\",3485,7347",
         "2346,\"joe,fred\",3486,484")
csv.parser(txt)
    ID Description    x    y
1 3434     abc"def  988  344
2 2344        fred 3484 3434
3 2345      fr""ed 3485 7347
4 2346    joe,fred 3486  484

【讨论】：

+1！甚至我认为你可以通过一个正则表达式来做到这一点，比如 ..,"|",|[0-9],[0-9]（我还没有测试过）。

【解决方案2】：

您可以使用readLines 并在," 和", 之间使用regmatches 提取元素

ll <- readLines(textConnection(object='ID,Description,x
  3434,"abc"def",988
2344,"fred",3484
2345,"fr""ed",3485
2346,"joe,fred",3486'))
ll<- ll[-1]     ## remove the header
ll <- regmatches(ll,gregexpr(',"|",',ll),invert=TRUE)
do.call(rbind,ll)
       [,1]     [,2]       [,3]  
[1,] "  3434" "abc\"def" "988" 
[2,] "2344"   "fred"     "3484"
[3,] "2345"   "fr\"\"ed" "3485"
[4,] "2346"   "joe,fred" "3486"

【讨论】：

谢谢，但是如果有 32 列，只有其中一些被引用呢？

【解决方案3】：

更改quote 设置：

read.csv(text=txt, colClasses='character',quote = "")

    ID Description    x
1 3434   "abc"def"  988
2 2344      "fred" 3484
3 2345    "fr""ed" 3485
4 2346       "joe" 3486

编辑以处理错误的逗号：

  txt <- c("ID,Description,x",
         "3434,\"abc\"def\",988",
         "2344,\"fred\",3484", 
         "2345,\"fr\"\"ed\",3485",
         "2346,\"joe,fred\",3486")

txt2 <- readLines(textConnection(txt)) 

txt2 <- strsplit(txt2,",")

txt2 <- lapply(txt2,function(x) c(x[1],paste(x[2:(length(x)-1)],collapse=","),x[length(x)]) )
m <- do.call("rbind",txt2)
df <- as.data.frame(m,stringsAsFactors = FALSE)
names(df) <- df[1,]
df <- df[-1,]

#     ID Description    x
# 2 3434   "abc"def"  988
# 3 2344      "fred" 3484
# 4 2345    "fr""ed" 3485
# 5 2346  "joe,fred" 3486

不知道，如果这对您的用例来说足够有效。

【讨论】：

很好地完成了这项工作 - 慢比准确更重要！
只有在“read.csv”出现错误并且可以并行化lapply 循环（例如使用mclapply）时才应该使用它。