生成指定字符串字符组合的代码的改进和并行化答案

【问题标题】：Improvement and parallelization of a code that produces combinations of specified string characters生成指定字符串字符组合的代码的改进和并行化
【发布时间】：2017-08-31 05:44:04
【问题描述】：

首先让我试着解释一下下面的代码在做什么。

从上面的列表中，它获取包含字符串"1MAKK" 并尝试在chars 处找到指定字符位置的可能组合

这是一个初始构象的例子

# Initial list
lst1 = list("P1"=list("1MAKK") )
chars = c("M","K")
classes = c("class.1","class.35")
# Get the P name
p_name = names(lst1[1])
# Get the string sequence
p_seq = unlist(lst1[[1]][1])

classes列表只不过是chars列表对应的一些标签而已，只是用于一些命名而已。

现在主要代码获取这些变量p_name , p_seq 并生成一个数据框，其中包含指定字符位置组合的所有可能组合。

这是代码：

library(stringr)          # str_locate
library(purrr)            # map2

# Functions
move_one <- function(seq){
  if(grepl("1" , seq))
    seq = paste0(substring(seq,2),1)
  else
    seq
}

# Move the number one from the first to last position
seq = move_one(p_seq)

# Get the positions of each character in the string
pos = unlist( map2( 
  .f=function(a ,p) str_locate_all(p, a) , 
  .x=chars , 
  .y=seq), 
  recursive = F
  )
# Check if there is a letter that didn't exist in the sequence and add zeros at the respective list item
for( x in 1:length(pos)){
  ifelse(is.na(pos[[x]][1]) , pos[[x]] <- rbind(pos[[x]] , c(0,0)) , pos[[x]] <- pos[[x]] )
}

# Calculate all possible combinations and transpose the arrays inside the list
ind1 = pmap( 
  .f = function(x) lapply(1:nrow(pos[[x]]), combn, x=as.list(pos[[x]][,1])), 
  .l = list( 1:length(pos) )  
  )

ind1 = pmap( 
  .f = function(x) lapply(ind1[[x]], t) , 
  .l = list( 1:length(ind1) )
  )

# Add Zero at each first element
z = pmap( 
  .f = function(x) lapply(ind1[[x]][1] , rbind , 0 ) , 
  .l = list( 1:length(ind1) )
  )
# Merge the list with the zeros and the complete one
ind1 = map2(
  .f = function(a,b) {a[1]<-b[1]; a},
  .x = ind1,
  .y = z)
# Create a vector for each letter combination
ind1 = pmap( 
  .f = function (x) unlist( lapply(ind1[[x]], function(i) do.call(paste, c(as.data.frame(i), sep = ':'))) ), 
  .l = list ( 1:length(ind1) )
  )

# Get the position of the class.1
isClass1 = grep("class.1", classes)
# Check if the seq is the first one
isFirst = grepl("1",seq)

# Set only 1 and 0 in the vector of UNIMOD.1 if is the first peptide
ifelse(isFirst , ind1[[isClass1]] <- c("1","0") , ind1[[isClass1]] <- c("0") ) 
# expand.grid for all these vectors inside ind1
ind2 = expand.grid(ind1)

# Apply column names in ind2
colnames(ind2) = classes
# Add a column with the p_name and seq
ind3 = cbind( "p_name"=rep(p_name, nrow(ind2) ) , "seq"=rep( gsub('.{1}$','',seq) , nrow(ind2) )  , ind2 )

该特定输入的结果将是

> ind3
  p_name  seq  class.1  class.35
1     P1 MAKK        1         3
2     P1 MAKK        0         3
3     P1 MAKK        1         4
4     P1 MAKK        0         4
5     P1 MAKK        1         0
6     P1 MAKK        0         0
7     P1 MAKK        1       3:4
8     P1 MAKK        0       3:4

如您所见，我尝试使用 lapply、map2、pmap 方法而不是 for 循环，以使其更快，并使其有机会在最终版本中在多个 CPU 内核中运行。

所以这里的某个地方我需要你的帮助和你的意见。

我的实际初始列表不是只有一个字符串字符而是看起来像下面这样，但不同的是有数千个内部列表（Px where x = {1,2,3,4,. ..2000} 并且每个 Px 可能有大约一百个序列。

p_list = list( "P1" = list( c("1MAK","ERTD","FTRWDSE" )) , "P2" = list( c("1MERTDF","DFRGRSDFG","DFFF")) )

第一个问题，可能也是最容易回答的问题，是如何在这样的列表中运行（应用）上述代码。

其次，我如何实现这一点以并行计算并使用具有 24 个 CPU 核心的服务器中的多个 CPU 核心，以节省一些时间。

P.S：最终的结果应该是所有单个结果（可能使用 rbind）的组合，（就像之前展示的那样）到一个数据框中。

欢迎任何改进、想法或建议。

谢谢。

【问题讨论】：

move_one 和 a.a 是什么？你能解释一下吗？
糟糕，抱歉。 a.a 是一个错字并将其更改为 chars 而move_one() 只是一个函数，如果从序列的第一个位置存在，则将其粘贴到最后一个位置。（我也贴出它的代码）
chars 只会是字母吗？不是文字？
是的。字符只能是字母。
很抱歉，但它必须进入您的流程，因为您使用了很多 map 和 apply 函数，在我看来这是不必要的。您应该首先编写处理一个单词的简单函数（尽可能快地完成），然后很容易将其应用于列表），但是您当前的方法太复杂了。（肯定会很慢）

标签： r parallel-processing combinations

【解决方案1】：

第一部分

这基本上是我将用于一个字符串的代码。最后，你会得到列列表（很高兴你知道它是什么）。

library(purrr)
x <- "MAKK"
chars <- set_names(c("M", "K"), c("class.1", "class.35"))

get_0_and_all_combn <- function(x) {
  map(seq_along(x), function(i) combn(x, i, simplify = FALSE)) %>%
    unlist(recursive = FALSE) %>%
    c(0L, .)
}
get_0_and_all_combn(3:4)
[[1]]
[1] 0

[[2]]
[1] 3

[[3]]
[1] 4

[[4]]
[1] 3 4

get_pos_combn <- function(x, chars) {
  x.spl <- strsplit(x, "")[[1]] 
  map(chars, function(chr) {
    which(x.spl == chr) %>%
      get_0_and_all_combn()
  }) %>%
    expand.grid()
}
get_pos_combn(x, chars)
  class.1 class.35
1       0        0
2       1        0
3       0        3
4       1        3
5       0        4
6       1        4
7       0     3, 4
8       1     3, 4

get_pos_combn_with_infos <- function(seq, chars, p_name) {
  cbind.data.frame(p_name, seq, get_pos_combn(seq, chars))
}
get_pos_combn_with_infos(x, chars, p_name)
  p_name  seq class.1 class.35
1     P1 MAKK       0        0
2     P1 MAKK       1        0
3     P1 MAKK       0        3
4     P1 MAKK       1        3
5     P1 MAKK       0        4
6     P1 MAKK       1        4
7     P1 MAKK       0     3, 4
8     P1 MAKK       1     3, 4

所以，现在如果你想让我完成我的回答，我需要知道与你的完整示例对应的 chars 和 classes 是什么

p_list = list( "P1" = list( c("1MAK","ERTD","FTRWDSE" )) , 
               "P2" = list( c("1MERTDF","DFRGRSDFG","DFFF")) )

另外，您确定要创建长度为 1 的“P1”和“P2”列表吗？

【讨论】：

因为我的评论有点大，所以我发布到 pastebin (pastebin.com/YvSciXZs)。感谢您的关注。
你的问题要求太多了。这是不可能的。
我明白了。无论如何，感谢您的帮助和您的时间:)