在 r 的列中对齐数据框的字符串答案

【问题标题】：Align strings of a dataframe in columns in r在 r 的列中对齐数据框的字符串
【发布时间】：2016-10-22 09:16:58
【问题描述】：

我有一个大数据框，我希望字符串根据后缀（子字符串）在列中对齐，源数据框如下所示：

notst 代表其他要忽略的变量前缀

#            col1       col2       col3 
#        notst-s1   notst-s2   notst-x3
#        notst-s1   notst-x3   notst-a5   
#        notst-s2   notst-a5
#        notst-x3   notst-a5

结果，应该是：

#            col1       col2       col3       col4 
#        notst-s1   notst-s2   notst-x3
#        notst-s1              notst-x3   notst-a5   
#                   notst-s2              notst-a5
#                              notst-x3   notst-a5

编辑：

考虑整个后缀（“-”之后）。它没有数字。在某些情况下，应该匹配整个字符串 ("xxxx-spst") (*)，因为字符串的 xxxx 部分有多个版本。

为：

df <- read.table(text="
           col1         col2        col3 
         st1-ab     stb-spst    sta-spst
       stc-spst     sta-spst      st4-ab   
       stb-spst       st7-ab
         st9-ba     stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

可能的结果，可能是：（列名和顺序无关）

#           col1         col2        col3       col4    
#         st1-ab     stb-spst    sta-spst      
#         st4-ab     stc-spst    sta-spst         
#         st7-ab     stb-spst       
#                    stb-spst                 st9-ba

(*) 请注意，在第 2 行 col2 中，“stc-spst”似乎放错了位置，但这不是问题，因为该行中不存在值 stb-spst，因此对于这种特殊情况，只有后缀（“spst”）很重要。换句话说，当整个字符串（前缀-后缀）与其他（在其他行）匹配时，它们应该在同一列，如果不是，当后缀匹配（其他行的）后缀时，它们应该在同一列柱子。生成的数据框应具有与原始数据框相同的行数和尽可能少的列数。

编辑。答案应该是通用的并且适用于：

df2 <- read.table(text="
col1         col2        col3       col4 
st1-ab       stb-spst    sta-spst   std-spst
stc-spst     sta-spst    st4-ab     st2-ab
stb-spst     st7-ab      sa-ac
st9-ba       stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

例如，也。可能的结果：

#           col1         col2        col3       col4    col5      col6     col7
#         st1-ab     stb-spst    sta-spst    std-spst  
#         st4-ab     stc-spst    sta-spst               st2-ab  
#         st7-ab     stb-spst                                     sa-ac  
#                    stb-spst                                           st9-ba

示例 3

df3 <- read.table(text="
col1         col2        col3       col4 
st1-ab       stb-spst    sta-spst   std-spst
stb-spst     sta-ab    
sta-spst     st7-ab      sa-ac
sta-spst     stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

想要的输出

  col1   col2     col3     col4     col5  
1       st1-ab    sta-spst stb-spst std-spst
2       sta-ab             stb-spst         
3 sa-ac st7-ab    sta-spst                  
4                 sta-spst stb-spst

编辑示例 4。为了使任务更容易，您可以在函数中显式定义每行可能有多个可能前缀的后缀。在这个例子中（“spst”）。因此，任何后缀不同于“spst”的字符串每行应该只有一个可能的前缀，并且可以而且必须折叠到结果df中的一列中，作为所需输出中的col2。这不是我最初想要的，因为我会得到比预期更多的列。理想情况下，包含 spst 和不同前缀的字符串应该出现在尽可能少的列中。往上看）。

df4 <- read.table(text="
col1         col2        col3       col4 
st1-ab       stb-spst    sta-spst   std-spst
stb-spst     st1-ab    
sta-spst     st7-ab      sa-ac
sta-spst     stb-spst    st7-ab",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

想要的输出

row_id  col1  col2          col3     col4     col5
1             st1-ab        sta-spst stb-spst std-spst
2             st1-ab                 stb-spst         
3       sa-ac st7-ab        sta-spst                  
4             st7-ab        sta-spst stb-spst

【问题讨论】：

您能否为我们提供一些关于如何移动数据的逻辑？为什么要这样做？
@Ferroao 编辑后的新示例数据和该示例的预期输出令人困惑
它像以前一样具有前缀和后缀（由 - 分隔）。但后缀中没有数字。当多个匹配项（列 2 和 3）时，输出基于后缀，在某些情况下是整个字符串。
在示例 3 中，我认为第 2 行的输出 col 2 应该是 sta-ab（或者第 2 行的输入，col2 应该是 st2-ab）

标签： r string match reshape

【解决方案1】：

我们可以通过首先melt对数据集执行此操作，从元素中提取数字索引，基于该索引创建行/列索引并将元素分配给基于最大值创建的matrix索引。

library(reshape2)
d1 <- na.omit(transform(melt(as.matrix(df1)), v1 = as.numeric(sub("\\D+", "", value))))
m1 <- matrix("", nrow = max(d1$Var1), ncol = max(d1$v1))
m1[as.matrix(d1[c("Var1", "v1")])]  <- as.character(d1$value) 
d2 <- as.data.frame(m1[,!!colSums(m1!="")])
colnames(d2) <- paste0("col", seq_along(d2))
d2
#     col1     col2     col3     col4
#1 notst-s1 notst-s2 notst-x3         
#2 notst-s1          notst-x3 notst-a5
#3          notst-s2          notst-a5
#4                   notst-x3 notst-a5

【讨论】：

这适用于编辑前的示例中的数字。但是，如果字符串共享数字，则这不起作用，例如。 -s1 和 -x1 也应该在不同的列中。从广义上讲，还应考虑没有数字的文本（如在编辑中）。

【解决方案2】：

矩阵索引可能使这成为可能：

sel <- dat!=""
unq <- unique(dat[sel])
mat <- matrix(NA, nrow=nrow(dat), ncol=length(unq))

mat[cbind(row(dat)[sel],  match(dat[sel], unq) )] <- dat[sel]

#     [,1]       [,2]       [,3]       [,4]      
#[1,] "notst-s1" "notst-s2" "notst-x3" NA        
#[2,] "notst-s1" NA         "notst-x3" "notst-a5"
#[3,] NA         "notst-s2" NA         "notst-a5"
#[4,] NA         NA         "notst-x3" "notst-a5"

dat 的导入位置为：

dat <- read.table(text="
    col1       col2       col3 
notst-s1   notst-s2   notst-x3
notst-s1   notst-x3   notst-a5   
notst-s2   notst-a5
notst-x3   notst-a5",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

【讨论】：

谢谢thelatemail，但是，您的回答不认为“notst”字符串可能会有所不同，akrun的回答认为。

【解决方案3】：

使用四个示例进行测试，但此版本未考虑您在示例 4 中作为解决方法添加的信息。

主要添加的是随机逻辑（可能很慢）以从右到左压缩生成的数据帧形式。 assigned_by_suffix 和 assigned_by_single_suffix 可能不再需要，但我尚未验证。

输出在代码的末尾

# examples
df1 <- read.table(text="
col1         col2        col3 
st1-ab     stb-spst    sta-spst
stc-spst     sta-spst      st4-ab   
stb-spst       st7-ab
st9-ba     stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

df2 <- read.table(text="
col1         col2        col3       col4 
st1-ab       stb-spst    sta-spst   std-spst
stc-spst     sta-spst    st4-ab     st2-ab
stb-spst     st7-ab      sa-ac
st9-ba       stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

df3 <- read.table(text="
col1         col2        col3       col4 
st1-ab       stb-spst    sta-spst   std-spst
stb-spst     sta-ab    
sta-spst     st7-ab      sa-ac
sta-spst     stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

df4 <- read.table(text="
col1         col2        col3       col4 
st1-ab       stb-spst    sta-spst   std-spst
stb-spst     st1-ab    
sta-spst     st7-ab      sa-ac
sta-spst     stb-spst    st7-ab",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)

library(reshape2)
library(tidyr)
library(dplyr)
library(stringr)
library(assertthat)

suffix <- function(s) {str_extract(s, "[^\\-]+$")}

# make a tall dataframe with melt, and get the suffix
dfm <- df4 %>% 
  mutate(row_id = seq_along(col1)) %>%
  melt(id.vars="row_id") %>%
  select(-2) %>%
  filter(value != "") %>%
  mutate(suffix = suffix(value)) %>%
  arrange(value)
assert_that(!any(duplicated(dfm[c("row_id", "value")])))

# initialize 
combined <- data.frame()
remaining <- dfm

# get the groups with more than 1 value
matched_values  <- dfm %>%
  group_by(value, suffix) %>%
  summarize(n=n()) %>%
  filter(n>1) %>%
  rename(group_id = value) %>%
  ungroup()

# .. and assign the group ids that match
assigned_by_value <- remaining %>% 
  inner_join(matched_values %>% select(group_id), by = c("value" = "group_id")) %>%
  mutate(group_id = value) %>%
  select(row_id, value, suffix, group_id)
combined <- combined %>% bind_rows(assigned_by_value)
remaining <- dfm %>% anti_join(combined, by=c("row_id", "value"))
# find the remaining suffixes 
matched_suffixes <- remaining  %>%
  group_by(suffix) %>%
  summarize(n=n()) %>%
  filter(n>1) %>%
  select(-n) %>%
  ungroup()

# ... and assign those that match
assigned_by_suffix <- remaining %>%
  inner_join(matched_suffixes, by="suffix") %>%
  mutate(group_id = suffix)
combined <- bind_rows(combined, assigned_by_suffix)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))


# All that remain are singles assign matches by suffix, choosing the match with fewest
assigned_by_single_suffix <- remaining %>%
  inner_join(matched_values, by = "suffix") %>%
  top_n(1, n) %>%
  head(1) %>%
  select(-n)
combined <- bind_rows(combined, assigned_by_single_suffix)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))

# get the remaining unmatched
unmatched <- remaining%>%
  mutate(group_id = value)
combined <- bind_rows(combined, unmatched)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
assert_that(nrow(remaining) == 0)

# any overloads (duplicates) need to bump to their own column
dups <- duplicated(combined[,c("row_id", "group_id")])
combined$group_id[dups] <- combined$value[dups]

assert_that(nrow(combined) == nrow(dfm))

# spread the result

result <- spread(combined %>% select(-suffix), group_id, value, fill ="")

# Shuffle any matching suffix from right to left, so l long as there
# is corresponding space an that the whole column can move
# i is source (startign from right) - j is target (starting from right) 
#
drop_cols = c()
suffixes <- suffix(names(result))
for (i in (ncol(result)):3) {
  for(j in (i-1):2) {
    if (suffixes[i] == suffixes[j]) {
      non_empty <- which(result[,i] != "") # list of source to move
      can_fill  <- which(result[,j] == "") # list of targets can be filled
      can_move   <- all(non_empty %in% can_fill) # is to move a subset of can_fill? 

      # if there's space, shuffle the column down
      if (can_move ) {
        # shuffle down
        result[,j] <- if_else(result[,j] != "", result[,j], result[,i]) 
        drop_cols <- c(drop_cols, i)
        result[,i] <- NA
        break
      }
    }                   
  }
}

if (!is.null(drop_cols)) {
  result <- result[,-drop_cols]
}
result

# Example 1
#   row_id     ab st9-ba sta-spst stb-spst
# 1      1 st1-ab        sta-spst stb-spst
# 2      2 st4-ab        sta-spst stc-spst
# 3      3 st7-ab                 stb-spst
# 4      4        st9-ba          stb-spst

# Example 2
#  row_id     ab sa-ac     spst st2-ab st9-ba sta-spst stb-spst
# 1      1 st1-ab       std-spst               sta-spst stb-spst
# 2      2 st4-ab       stc-spst st2-ab        sta-spst         
# 3      3 st7-ab sa-ac                                 stb-spst
# 4      4                              st9-ba          stb-spst 

# Example 3
#   row_id     ab sa-ac sta-spst stb-spst std-spst
# 1      1 st1-ab       sta-spst stb-spst std-spst
# 2      2 sta-ab                stb-spst         
# 3      3 st7-ab sa-ac sta-spst                  
# 4      4              sta-spst stb-spst   

# Example 4
#   row_id sa-ac st1-ab sta-spst stb-spst std-spst
# 1      1       st1-ab sta-spst stb-spst std-spst
# 2      2       st1-ab          stb-spst         
# 3      3 sa-ac st7-ab sta-spst                  
# 4      4       st7-ab sta-spst stb-spst         
>

【讨论】：

我看到了问题，但误解了一些目标。我会做最后一次尝试