【问题标题】:Merge/Collapsing identical consecutive elements in a vector合并/折叠向量中相同的连续元素
【发布时间】:2017-09-26 03:40:34
【问题描述】:

我正在尝试将相同的连续观察合并到一个折叠的字符串中。一个简单的例子如下:

a <- c("H", "H", "H", "N", "T", "N", "T", "H", "N", "T", "T")
[1] "H" "H" "H" "N" "T" "N" "T" "H" "N" "T" "T"

b <- c("HHH", "N", "T", "N", "T", "H", "N", "TT")
[1] "HHH" "N"   "T"   "N"   "T"   "H"   "N"   "TT"

c <- c("HHH", "HHH", "N", "T", "N", "T", "H", "N", "TT", "TT")
[1] "HHH" "HHH" "N"   "T"   "N"   "T"   "H"   "N"   "TT"  "TT" 

在这里,我想创建一个函数,它将向量a 转换为向量bc。例如,由于前三个观测值都是H,它们一起将变为HHH。与两个T 变成TT 相同。注意我要保持整体顺序,给定元素连续出现的次数不限于3次。因此,例如,可能有 10 个 A 连续出现,它们应该转换为单个 AAAAAAAAAA

我尝试从for 循环开始逐步建立,但由于连续出现重复次数不受限制的问题,无法进一步构建。我还尝试过使用基本的rle 函数。但是

rle(a)

给出类似的东西

Run Length Encoding
  lengths: int [1:8] 3 1 1 1 1 1 1 2
  values : chr [1:8] "H" "N" "T" "N" "T" "H" "N" "T"

其中十个元素变成了8个,连续出现的位置不记录。

【问题讨论】:

    标签: r


    【解决方案1】:

    您可以将gregexprregmatches 一起使用:

    a <- c("H", "H", "H", "N", "T", "N", "T", "H", "N", "T", "T")
    
    # collapse string
    b <- paste(a, collapse = "")
    
    # extract instances of repeated characters
    regmatches(b, gregexpr("(.)\\1*", b))[[1]]
    # [1] "HHH" "N"   "T"   "N"   "T"   "H"   "N"   "TT"
    

    stringi 等效项可能是:

    library(stringi)
    stri_extract_all_regex(b, "(.)\\1*")[[1]]
    # [1] "HHH" "N"   "T"   "N"   "T"   "H"   "N"   "TT"
    

    还有ore 包是很好的衡量标准:

    library(ore)
    matches(ore.search("(.)\\1*", b, all = TRUE))
    #[1] "HHH" "N"   "T"   "N"   "T"   "H"   "N"   "TT"
    

    【讨论】:

      【解决方案2】:
      with(rle(a), sapply(1:length(values), function(i)
          paste(rep(values[i], lengths[i]), collapse = "")))
      #[1] "HHH" "N"   "T"   "N"   "T"   "H"   "N"   "TT" 
      

      sapply(split(a, cumsum(c(TRUE, a[-1] != head(a, -1)))), paste, collapse = "")
      #    1     2     3     4     5     6     7     8 
      #"HHH"   "N"   "T"   "N"   "T"   "H"   "N"  "TT" 
      

      【讨论】:

      • 哇——太快了!非常感谢!
      【解决方案3】:

      我们可以从data.table使用rleid

      library(data.table)
      unname(tapply(a, rleid(a), FUN = paste, collapse=""))
      #[1] "HHH" "N"   "T"   "N"   "T"   "H"   "N"   "TT" 
      

      或者base Rrletapply

      with(rle(a), unname(tapply(a, rep(seq_along(values), lengths), FUN = paste, collapse="")))
      #[1] "HHH" "N"   "T"   "N"   "T"   "H"   "N"   "TT" 
      

      或者base R 选项将paste 将字符串放在一起,并使用正则表达式查找在重复字符之间拆分

      strsplit(paste(a, collapse=""), "(?<=(.))(?!\\1)", perl = TRUE)[[1]]
      #[1] "HHH" "N"   "T"   "N"   "T"   "H"   "N"   "TT" 
      

      【讨论】:

        【解决方案4】:

        除了已经给出的解决方案之外,我还对不依赖任何语言特异性的通用算法感兴趣。

        你说你试过了,但我不认为重复次数不受限制是一个真正的问题。我写的是,基本上,迭代原始数组并克隆它。如果原始数组的某个值与最后一个相同,则不要将其作为新项添加到新数组中,而是将其连接到“克隆”数组的最后一个值中。

        算法:

        Create empty array(w)
        Iterate by index(i) of the original vector(v)
           If this is the first entry
              w[1] = v[1]
           Else
              If v[i] is the same as v[i-1]
                 Last entry in w is concatenated with v[i]
              Else
                 Add v[i] to the end of w
        

        在 Python 中:

        def collapseVector(v):
            w = [];
            for i in range(len(v)):
                if i == 0:
                    w.append(v[i]);
                else:
                    if v[i] == v[i-1]:
                        w[len(w)-1] = w[len(w)-1] + v[i];
                    else:
                        w.append(v[i]);
            return w
        

        【讨论】:

          猜你喜欢
          • 2019-06-02
          • 1970-01-01
          • 1970-01-01
          • 2012-12-07
          • 2021-04-05
          • 2016-09-04
          • 2021-09-12
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多