庞大的数据文件和运行多个参数和内存问题，Fisher 测试答案

【问题标题】：Huge data file and running multiple parameters and memory issue, Fisher's test庞大的数据文件和运行多个参数和内存问题，Fisher 测试
【发布时间】：2014-08-09 11:07:51
【问题描述】：

我有一个 R 代码，我试图在服务器中运行。但它可能由于内存限制而停在中间/冻结。数据文件非常庞大（一个有 2000 万行），如果您查看代码中的双 for 循环，length(ratSplit) = 281 和 length(humanSplit) = 36。该数据有人类和大鼠基因的特定数据，人类有36个重复，而大鼠有281个。所以，循环基本上是281*36步。我想要做的是使用函数getGeneType 处理数据，看看不同复制组合的表达有多么不同/独立。使用费舍尔检验。数据 rat_processed_7_25_FDR_05.out 如下所示：

2       Sptbn1  114201107       114200202       chr14|Sptbn1:114201107|Sptbn1:114200202|reg|-   2       Thymus_M_GSM1328751     reg
2       Ndufb7  35680273        35683909        chr19|Ndufb7:35680273|Ndufb7:35683909|reg|+     2       Thymus_M_GSM1328751     rev
2       Ndufb10 13906408        13906289        chr10|Ndufb10:13906408|Ndufb10:13906289|reg|-   2       Thymus_M_GSM1328751     reg
3       Cdc14b  1719665 1719190 chr17|Cdc14b:1719665|Cdc14b:1719190|reg|-       3       Thymus_M_GSM1328751     reg

而数据fetal_output_7_2.out 的格式为

SPTLC2  78018438        77987924        chr14|SPTLC2:78018438|SPTLC2:77987924|reg|-     11      Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg
EXOSC1  99202993        99201016        chr10|EXOSC1:99202993|EXOSC1:99201016|rev|-     5       Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg
SHMT2   57627893        57628016        chr12|SHMT2:57627893|SHMT2:57628016|reg|+       8       Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg
ZNF510  99538281        99537128        chr9|ZNF510:99538281|ZNF510:99537128|reg|-      8       Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg
PPFIBP1 27820253        27824363        chr12|PPFIBP1:27820253|PPFIBP1:27824363|reg|+   10      Fetal_Brain_408_AGTCAA_L006_R1_report.txt       reg

现在我有几个关于如何提高效率的问题。我认为当我运行这段代码时，R 会占用大量内存，最终导致问题。我想知道是否有任何方法可以更有效地做到这一点

另一种可能性是使用双 for-loop'。 sapply 会有帮助吗？那么，我应该如何申请 sapply？

最后，我想将result 转换为 csv 文件。我知道放置这样的代码有点让人不知所措。但是任何优化/高效的编码/编程都会很多！我真的需要至少运行整个事情才能尽快获取数据。

#this one compares reg vs rev
date()
ratRawData <- read.table("rat_processed_7_25_FDR_05.out",col.names = c("alignment", "ratGene", "start", "end", "chrom", "align", "ratReplicate", "RNAtype"), fill = TRUE)
humanRawData <- read.table("fetal_output_7_2.out", col.names = c("humanGene", "start", "end", "chrom", "alignment", "humanReplicate", "RNAtype"), fill = TRUE)
geneList <- read.table("geneList.txt", col.names = c("human", "rat"), sep = ',')
#keeping only information about gene, alignment number, replicate and RNAtype, discard other columns
ratRawData <- ratRawData[,c("ratGene", "ratReplicate", "alignment", "RNAtype")]
humanRawData <- humanRawData[, c( "humanGene", "humanReplicate", "alignment", "RNAtype")]

#function to capitalize
capitalize <- function(x){
  capital <- toupper(x) ## capitalize 
  paste0(capital)
}

#capitalizing the rna type naming for rat. So, reg ->REG, dup ->DUP, rev ->REV
#doing this to make data manipulation for making contingency table easier.
levels(ratRawData$RNAtype) <- capitalize(levels(ratRawData$RNAtype))

#spliting data in replicates
ratSplit <- split(ratRawData, ratRawData$ratReplicate)
humanSplit <- split(humanRawData, humanRawData$humanReplicate)

print("done splitting")
#HyRy :when some gene has only reg, rev , REG, REV
#HnRy : when some gene has only reg,REG,REV
#HyRn : add 1 when some gene has only reg,rev,REG
#HnRn : add 1 when some gene has only reg,REG

#function to be used to aggregate 
getGeneType <- function(types) {
  types <- as.character(types)
  if ('rev' %in% types) {
    return(ifelse(('REV' %in% types), 'HyRy', 'HyRn'))
  } 
  else {
    return(ifelse(('REV' %in% types), 'HnRy', 'HnRn'))
  }
}

#logical function to see whether x is integer(0) ..It's used the for loop bellow in case any one HmYn is equal to zero
is.integer0 <- function(x) {
  is.integer(x) && length(x) == 0L
}

result <- data.frame(humanReplicate = "human_replicate", ratReplicate = "rat_replicate", pvalue = "p-value", alternative = "alternative_hypothesis", 
                     Conf.int1 = "conf.int1", Conf.int2 ="conf.int2", oddratio = "Odd_Ratio")
for(i in 1:length(ratSplit)) {
  for(j in 1:length(humanSplit)) {
    ratReplicateName <- names(ratSplit[i])
    humanReplicateName <- names(humanSplit[j])

    #merging above two based on the one-to-one gene mapping as in geneList defined above.
    mergedHumanData <-merge(geneList,humanSplit[[j]], by.x = "human", by.y = "humanGene")
    mergedRatData <- merge(geneList, ratSplit[[i]], by.x = "rat", by.y = "ratGene")

    mergedHumanData <- mergedHumanData[,c(1,2,4,5)] #rearrange column
    mergedRatData <- mergedRatData[,c(2,1,4,5)]  #rearrange column
    mergedHumanRatData <- rbind(mergedHumanData,mergedRatData) #now the columns are "human", "rat", "alignment", "RNAtype"

    agg <- aggregate(RNAtype ~ human+rat, data= mergedHumanRatData, FUN=getGeneType) #agg to make HmYn form
    HmRnTable <- table(agg$RNAtype) #table of HmRn ie RNAtype in human and rat.

    #now assign these numbers to variables HmYn. Consider cases when some form of HmRy is not present in the table. That's why
    #is.integer0 function is used
    HyRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRy"]), 0, HmRnTable[names(HmRnTable) == "HyRy"][[1]])
    HnRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRn"]), 0, HmRnTable[names(HmRnTable) == "HnRn"][[1]])
    HyRn <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HyRn"]), 0, HmRnTable[names(HmRnTable) == "HyRn"][[1]])
    HnRy <- ifelse(is.integer0(HmRnTable[names(HmRnTable) == "HnRy"]), 0, HmRnTable[names(HmRnTable) == "HnRy"][[1]])

    contingencyTable <- matrix(c(HnRn,HnRy,HyRn,HyRy), nrow = 2)
    # contingencyTable: 
    # HnRn --|--HyRn
    # |------|-----|
    # HnRy --|-- HyRy
    #

    fisherTest <- fisher.test(contingencyTable)

    #make new line out of the result of fisherTest
    newLine <- data.frame(t(c(humanReplicate = humanReplicateName, ratReplicate = ratReplicateName, pvalue = fisherTest$p,
                              alternative = fisherTest$alternative, Conf.int1 = fisherTest$conf.int[1], Conf.int2 =fisherTest$conf.int[2], 
                              oddratio = fisherTest$estimate[[1]])))

    result <-rbind(result,newLine) #append newline to result
    if(j%%10 = 0) print(c(i,j))
  }
}

write.table(result, file = "compareRegAndRev.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ",")

【问题讨论】：

如果您正在处理庞大的数据集并遇到内存问题，我强烈建议您探索 R 中的 data.table 包。

标签： r performance export-to-csv write.table

【解决方案1】：

参考Monitor memory usage in R接受的答案，R使用的内存量可以用gc()跟踪。

如果脚本确实内存不足（这不会让我感到惊讶），解决问题的最简单方法是将write.table() 从循环的外部移动到循环的内部，以替换rbind()。只需为从每个输出写入的 CSV 文件创建一个新文件名，例如作者：

csvFileName <- sprintf("compareRegAndRev%03d_%03d.csv",i,j)

如果写入的 CSV 文件没有标头，则它们可以在 R 之外单独连接（例如，在 Unix 中使用 cat）并稍后添加标头。

虽然这种方法可能会成功创建所需的 CSV 文件，但文件可能太大而无法随后处理。如果是这样，最好单独处理 CSV 文件，而不是将它们连接起来。

【讨论】：