如何让R计算数据框中元素中的字符数？答案

【问题标题】：How to make R count the number of characters in an element in a dataframe?如何让R计算数据框中元素中的字符数？
【发布时间】：2018-06-20 19:04:20
【问题描述】：

structure(list(Switch = c("4", "3"), `1` = c("1, 2, 3, 4", 
NA), `2` = c("1, 2, 3, 4", NA), `3` = c("1, 2, 3, 4, 6, 7", 
NA), `4` = c("1, 2, 3, 4, 5, 6", NA), `5` = c("1, 2, 3, 4", 
"1"), `6` = c("1, 2, 3, 4", NA
)), .Names = c("Switch", "1", "2", "3", "4", "5", 
"6"), row.names = 1:2, class = "data.frame")

鉴于上述数据框。我想让 R 计算每个元素中有多少个数字（用逗号分隔）。例如，包含列表1, 2, 3, 4 的元素包含 4 个数字。

我希望 R 计算在转换年之前（第 1 列）和转换年之后每行有多少总数。

以第一行为例；切换年份是 4，在第 1 年有 4 个不同的数字，在第 2 年有 4，在第 3 年有 6。所以 R 在新列中添加总数为 4+4+6=14。然后它对转换年份之后的年份（第 5 年和第 6 年）执行相同的操作，并在第二个新列中输出总和。

在我的一次搜索中，建议使用 stringi 包中的函数 stri_extract_all_regex，但我只能让它工作一列/年，而且它似乎也将 NA 值计算为好吧，它不应该这样做。

下面的代码给出了预期的输出：

    structure(list(Switch = c("4", "3"), `1` = c("1, 2, 3, 4", 
NA), `2` = c("1, 2, 3, 4", NA), `3` = c("1, 2, 3, 4, 6, 7", 
NA), `4` = c("1, 2, 3, 4, 5, 6", NA), `5` = c("1, 2, 3, 4", 
"1"), `6` = c("1, 2, 3, 4", NA
), `Before` = c("15", 0), `After` = c("8", 1)
), .Names = c("Switch", "1", "2", "3", "4", "5", 
"6", "Before", "After"), row.names = 1:2, class = "data.frame")

【问题讨论】：

你能发布预期的输出吗？
lapply(df, stringi::stri_count_words)?
@Moody_Mudskipper。我已经更新了问题以包含预期的输出。

标签： r

【解决方案1】：

另一个stringi解决方案：

library(stringi)

df[c("before","after")] <-
  t(apply(df,1,function(x) {
    counts <- stri_count_words(x[-1])
    x <- as.numeric(x[1])
    c(sum(head(counts,x-1),na.rm=TRUE),
      sum(tail(counts ,-x),na.rm=TRUE))
  }))

#   Switch          1          2                3                4          5          6 before after
# 1      4 1, 2, 3, 4 1, 2, 3, 4 1, 2, 3, 4, 6, 7 1, 2, 3, 4, 5, 6 1, 2, 3, 4 1, 2, 3, 4     14     8
# 2      3       <NA>       <NA>             <NA>             <NA>          1       <NA>      0     1

【讨论】：

【解决方案2】：

library(stringi)

df2 <- df
# Count words and coerce to numeric
df2[-1] <- lapply(df2[-1], stri_count_words)
df2[1]  <- lapply(df2[1], as.numeric)
# For each row, sum the number of words before (part1) and after (part2)
newcols <- 
apply(t(df2), 2, function(x){ 
  part1 <- x[-1][1:(x[1] - 1)]
  part2 <- x[-1][-(1:x[1])]
  list(before = sum(part1, na.rm = T),
       after  = sum(part2, na.rm = T))})


cbind(df, do.call(rbind, newcols))


#   Switch          1          2                3                4          5          6
# 1      4 1, 2, 3, 4 1, 2, 3, 4 1, 2, 3, 4, 6, 7 1, 2, 3, 4, 5, 6 1, 2, 3, 4 1, 2, 3, 4
# 2      3       <NA>       <NA>             <NA>             <NA>          1       <NA>
#   before after
# 1     14     8
# 2      0     1

【讨论】：

嗨瑞安，有几个小问题，首先，我认为“之前”列结果中添加了一个加号，这可能是我的错，因为我在示例输出中有一个错误（除了我的错误是 14 而不是 15）。其次，代码为列名边界上的 Switch 值输出奇怪的结果。但更具体地说，当我们在具有 11 个 Switch 列和 Switch 列之前的 3 个多列的更大数据帧上尝试它时，它错误地计算了“之后”。你能帮忙解决这些问题吗？