在 R 中读取英格兰和威尔士慈善委员会 bcp 文件答案

【问题标题】：Reading England and Wales Charity Commission bcp files in R在 R 中读取英格兰和威尔士慈善委员会 bcp 文件
【发布时间】：2021-06-09 16:15:18
【问题描述】：

我正在尝试读取 R 中 https://register-of-charities.charitycommission.gov.uk/register/full-register-download 提供的 .bcp 文件。我一直在尝试之前在这里回答的问题，但 readChar 似乎并未读取所有文件中的所有内容，即它会中断 extract_charity。 bcp.

所以我想到了 readBin 并尝试像这样读取 extract_charity.bcp：

library(stringr)

b <- readBin("extract_charity.bcp", "character", n = 300000, size = NA_integer_,
             endian = .Platform$endian)

c<- paste0(b, collapse = "" ) #put it back as one large character string

d<- str_locate_all(c, "\\*\\@\\@\\*\\d") #find row breaks followed by a digit

e <- d[[1]]

flags <- e[,1]

f <- c()

f[1] <- substr(c, 1, flags[1]-1)

for (i in 2:length(flags)) f[i]<- substr(c, flags[i-1]+4, flags[i]-1) #removes row breaks

export <- matrix(nrow = 372432, ncol = 18)
exportF <- matrix(nrow = 0, ncol = 18)

for (j in 1:length(flags)) {
  new_row <- str_split( f[j], "\\@\\*\\*\\@" )[[1]] #removes column breaks
  if (length(new_row)==18) { export[j, ] <- new_row #if correct number of columns
  } else {  print(j)
            exportF <- rbind(exportF, new_row) }}

但是，有 49 个错误 - 都属于同一类型。在表格的各个位置插入了一个奇怪的字符串 - 目前它是“P`j[Ÿ”但是当我再次运行脚本时，它是“°Tj[Ÿ”，所以每次我运行它都会提供不同的字符串脚本，所以我无法运行脚本来手动删除它：

str_replace_all(c, problem, "") 

Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement),  : 
  Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET)

【问题讨论】：

这些文件旨在使用其bcp 工具加载到 SQL Server 数据库（使用 TableBuildScripts.zip 中的脚本创建）。帮自己一个忙并使用它，而不是尝试自己解析它。
这能回答你的问题吗？ Opening .bcp files in R
@MarkRotteveel 不，它不会，因为 readChar 不会返回文件的所有内容
@MarkRotteveel 我可以几乎完全解析它们，我对这个错误感到困惑 - 我知道它与编码有关，但不知道如何使它工作

标签： r csv stringr bcp

【解决方案1】：

只是为了让世界知道，这是可以做到的。在第一遍中，文件被解析并将问题存储在 exportF 中，在那里它被识别并从原始解析输出中删除。然后在第二遍中，正确解析。

这是一团糟，但它确实有效，而且速度也很快。

library(stringr)
library(stringi)


b <- readBin("extract_charity.bcp", "character", n = 300000, size = NA_integer_)

c<- paste0(b, collapse = "" )

tt<- str_locate_all(c, "\\*\\@\\@\\*\\d")

e <- tt[[1]]

flags <- e[,1]

f <- c()

f[1] <- substr(c, 1, flags[1]-1)

for (i in 2:length(flags)) {
  f[i]<- substr(c, flags[i-1]+4, flags[i]-1)
}



export <- matrix(nrow = length(flags), ncol = 18)
exportF <- matrix(nrow = 0, ncol = 18)


for (j in 1:length(flags)) {
  new_row <- str_split( f[j], "\\@\\*\\*\\@" )[[1]]
  if (length(new_row)==18) { export[j, ] <- new_row
  } else {print(flags[j])
      exportF <- rbind(exportF, new_row) }}


#go trough the first line and see where the problem is and locate its position 
problem <- str_sub(as.character(exportF[1,8]), 5, 10)

#CHECK TO SEE IF CORRECT
problem %in% str_sub(exportF[1,8], 5, 10)

problem %in% exportF[1,8]

str_detect(c,problem )

str_detect(b[324],problem )



#d <-stri_replace_all_charclass(b, problem, "") 
str_detect(d,problem )

r<- gsub(problem, "", b )

str_detect(r,problem )

#now go again but with clean data

r<- paste0(r, collapse = "" )
tt<- str_locate_all(r, "\\*\\@\\@\\*\\d")

e <- tt[[1]]

flags <- e[,1]

f <- c()


f[1] <- substr(r, 1, flags[1]-1)

for (i in 2:length(flags)) {
  f[i]<- substr(r, flags[i-1]+4, flags[i]-1)
}

#g<- str_split(f[372432], "\\@\\*\\*\\@")[[1]]

export <- matrix(nrow = 372434, ncol = 18)
exportF <- matrix(nrow = 0, ncol = 18)


for (j in 1:length(flags)) {
  new_row <- str_split( f[j], "\\@\\*\\*\\@" )[[1]]
  if (length(new_row)==18) { export[j, ] <- new_row
  } else {print(flags[j])
    exportF <- rbind(exportF, new_row) }}









write.csv(export, "extract_charity2021.csv", row.names = F)

把它留在这里，以备将来我自己或需要这样做的人使用。

【讨论】：