【问题标题】:Parallel processing for 2 datasets并行处理 2 个数据集
【发布时间】:2015-03-21 19:09:43
【问题描述】:

这基本上是我之前提出的这个问题的后续。 链接是:

http://stackoverflow.com/questions/28115272/how-can-i-accomplish-parallel-processing-in-r

现在,代码,即:

library( doParallel )
cl <- makeCluster( 2 ) # for 2 processors, i.e. 2 parallel chains
registerDoParallel( cl )

datalist <- list(mydataset1 , mydataset2)

# now start the chains
nchains <- 2 # for two processors

results_list <- foreach(i=1:nchains , 
                .packages = c( 'packages_you_need') ) %dopar% {
     result <- find.string( datalist[[i]] )
     return(result) }

datalist 包含 2 个简单的字符串时,这似乎工作得很好,例如,

datalist <- list("abcabcabc","adcadcadc")

但是,如果我合并两个实际数据集,每个数据集都包含多行字符串,例如,

Dataset1:

abcabcabc
adcadcadc
aecaecaec
afcafcafc
.........

Dataset2:

xyzxyzxyz
xzcxzcxzc
xtcxtcxtc
xdcxdcxdc
.........

如果我有这样的数据集,那么这会产生一个错误:

Error in { : task 1 failed - "'to' must be of length 1"

关于为什么会发生这种情况或如何删除它的任何建议?

谢谢!

编辑:

str(datalist) - List of 2
 $ : chr [1:3631] "000000000fbff000ff0000f00000" "000000000000fffffffffff0f000" "bb0bb00000f000000000bfff0000" "00b0b000bfbffffbffbf0ff00000" ...
 $ : chr [1:3631] "000000000srst000tt0000t00000" "000000000000ttttttttttt0r000" "ss0tt00000q000000000sstt0000" "00s0q000ssqtsstrstss0ss00000" ...


dput(head(datalist))

"00000000r0t0st0000p000000000", "00000ssssttstssttts000000000", 
"000000000r00sq000tp000000000", "0000000000tsq0sq0qt000000000", 
"000q0000r00000000rss00000000", "00000000ttttttttttt000000000", 
"0000000000s0qs000s0000000000", "000000ppqppqsrrrsr0000000000", 
"00000r00s0t00ss00st000000000", "0000000000s000s0tt0000000000", 
"00000s0000ttstq000t000000000", "0000000000qrs0t0s00t00000000", 
"000000000s000stt0t0000000000", "0000000000qtr0000t0000000000", 
"0000000000rrsrsqrr0000000000", "0000000000tsp0s000s000000000", 
 ..............................................................

Edit2:每个数据集中 4 个元素的示例。

str(datalist)

List of 2
 $ : chr [1:4] "000000000fbff000ff0000f00000" "000000000000fffffffffff0f000" "bb0bb00000f000000000bfff0000" "00b0b000bfbffffbffbf0ff00000"
 $ : chr [1:4] "000000000srst000tt0000t00000" "000000000000ttttttttttt0r000" "ss0tt00000q000000000sstt0000" "00s0q000ssqtsstrstss0ss00000"


 dput(head(datalist))
list(c("000000000fbff000ff0000f00000", "000000000000fffffffffff0f000", 
"bb0bb00000f000000000bfff0000", "00b0b000bfbffffbffbf0ff00000"
), c("000000000srst000tt0000t00000", "000000000000ttttttttttt0r000", 
"ss0tt00000q000000000sstt0000", "00s0q000ssqtsstrstss0ss00000"
))

【问题讨论】:

  • 您可以编辑您的问题并将str(datalist)dput(head(datalist)) 的结果粘贴到其中吗?这将使故障排除变得更加容易。
  • 我已经做到了。 :)
  • 不要截断dput - 我将复制它并使用它进行测试,所以我需要整个东西。
  • 我不会这样做,但它包含大约 3631 行,很难在此处附加。
  • dput(head(datalist)) 只会给出每个部分的几行,这就是head 的功能。但是,如果您的意思是有 3631 个列表元素,请将其子集为 3-4 个元素,然后 dput(head...))。谢谢。更新:你不需要子集,我看到元素本身是 3631 长。

标签: r multithreading parallel-processing


【解决方案1】:

这是一个使用您的数据结构的工作示例。为了简单和测试,我使用了grep 而不是你的find.string。添加了一个.combine 参数并分配了foreach 操作的值(我上周在@SteveWesson 帮助我时犯了一个错误)。顺便说一句,当您有更长的数据时,您需要将%do% 更改为%dopar%。根据我的经验,除非您有更长的向量,否则并行操作不会启动。这不是您想要做的,但希望您可以从这里开始工作。

library( doParallel )
cl <- makeCluster( 2 ) # for 2 processors, i.e. 2 parallel chains
registerDoParallel( cl )

datalist <- list(c("000000000fbff000ff0000f00000", "000000000000fffffffffff0f000", 
"bb0bb00000f000000000bfff0000", "00b0b000bfbffffbffbf0ff00000"
), c("000000000srst000tt0000t00000", "000000000000ttttttttttt0r000", 
"ss0tt00000q000000000sstt0000", "00s0q000ssqtsstrstss0ss00000"
))

nchains <- 2 # for two processors

out <- results_list <- foreach(i=1:nchains, .combine = c) %do% {
     result <- grep("ff|tt", unlist(datalist[[i]]))
     print(result)
     return(result) }
out # shows result too

【讨论】:

  • 这似乎不起作用。在我的情况下,我尝试将 grep 调整为 find.string(),它接受一个字符向量,但它似乎不起作用。我最终得到相同的错误消息:Error in { : task 1 failed - "'to' must be of length 1" In addition: Warning messages: 1: In len:1 : numerical expression has 4 elements: only the first used 2: In len:1 : numerical expression has 4 elements: only the first used 3.
  • 好的,我想编辑您的原始问题以包含find.string,我们会看看。请稍微尝试一下,以使 find.string 尽可能短以导致错误。
猜你喜欢
  • 2021-11-25
  • 1970-01-01
  • 2014-05-23
  • 2014-01-27
  • 2019-05-10
  • 1970-01-01
  • 2022-11-11
  • 1970-01-01
相关资源
最近更新 更多