【问题标题】:R Extract duplicate words in stringR提取字符串中的重复单词
【发布时间】:2017-02-04 14:25:59
【问题描述】:

我有字符串 ab 组成我的 data。我的目的是获取一个包含重复单词的新变量。

    a = c("the red house av", "the blue sky", "the green grass")
    b = c("the house built", " the sky of the city", "the grass in the garden")

data = data.frame(a, b)

基于这个answer,我可以得到与duplicated()重复的逻辑

data = data%>% mutate(c = paste(a,b, sep = " "),
                     d = vapply(lapply(strsplit(c, " "), duplicated), paste, character(1L), collapse = " "))

但我无法获得单词。我想要的数据应该是这样的

> data.1
                 a                       b         d
1 the red house av         the house built the house
2     the blue sky     the sky of the city   the sky
3  the green grass the grass in the garden the grass

对上述功能的任何帮助将不胜感激。

【问题讨论】:

  • 你就在那里。您当前的d 是一个逻辑向量,对于重复的单词为 TRUE,对于非重复的单词为 FALSE;只需将其用于子集c。例如。将duplicated 更改为function (x) x[duplicate(x)]
  • 非常感谢@mathematical.coffee。就像:data = data%>% mutate(c = paste(a,b, sep = " "), d = vapply(lapply(strsplit(c, " "), function (x) x[duplicated(x)]), paste, character(1L), collapse = " "))
  • @Edu 小心。您的函数在将单词粘贴在一起后对其进行标记,这意味着它无法区分该单词是来自a 还是来自b。看看如果您首先更新数据然后运行该函数会发生什么:data$a[1] <- "the red house av av"。 “av” 显示为重复,即使它没有出现在 b 中。

标签: r string duplicates


【解决方案1】:
a = c("the red house av", "the blue sky", "the green grass")
b = c("the house built", " the sky of the city", "the grass in the garden")

data <-  data.frame(a, b, stringsAsFactors = FALSE)

func <- function(dta) {
    words <- intersect( unlist(strsplit(dta$a, " ")), unlist(strsplit(dta$b, " ")) )
    dta$c <- paste(words, collapse = " ")
    return( as.data.frame(dta, stringsAsFactors = FALSE) )
}

library(dplyr)
data %>% rowwise() %>% do( func(.) )

结果:

#Source: local data frame [3 x 3]
#Groups: <by row>
#
## A tibble: 3 x 3
#                 a                       b         c
#*            <chr>                   <chr>     <chr>
#1 the red house av         the house built the house
#2     the blue sky     the sky of the city   the sky
#3  the green grass the grass in the garden the grass

【讨论】:

    【解决方案2】:

    这是使用 base R 的另一种尝试(不需要包):

    df$c <- apply(df,1,function(x) 
                   paste(Reduce(intersect, strsplit(x, " ")), collapse = " "))
    
                     # a                       b         c
    # 1 the red house av         the house built the house
    # 2     the blue sky     the sky of the city   the sky
    # 3  the green grass the grass in the garden the grass
    

    数据

    df <- structure(list(a = c("the red house av", "the blue sky", "the green grass"
    ), b = c("the house built", " the sky of the city", "the grass in the garden"
    )), .Names = c("a", "b"), row.names = c(NA, -3L), class = "data.frame")
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-05-21
      • 2021-09-23
      • 2017-11-09
      • 2013-12-15
      • 2018-10-11
      • 2011-04-21
      相关资源
      最近更新 更多