R提取字符串中的重复单词答案

【问题标题】：R Extract duplicate words in stringR提取字符串中的重复单词
【发布时间】：2017-02-04 14:25:59
【问题描述】：

我有字符串 a 和 b 组成我的 data。我的目的是获取一个包含重复单词的新变量。

    a = c("the red house av", "the blue sky", "the green grass")
    b = c("the house built", " the sky of the city", "the grass in the garden")

data = data.frame(a, b)

基于这个answer，我可以得到与duplicated()重复的逻辑

data = data%>% mutate(c = paste(a,b, sep = " "),
                     d = vapply(lapply(strsplit(c, " "), duplicated), paste, character(1L), collapse = " "))

但我无法获得单词。我想要的数据应该是这样的

> data.1
                 a                       b         d
1 the red house av         the house built the house
2     the blue sky     the sky of the city   the sky
3  the green grass the grass in the garden the grass

对上述功能的任何帮助将不胜感激。

【问题讨论】：

你就在那里。您当前的d 是一个逻辑向量，对于重复的单词为 TRUE，对于非重复的单词为 FALSE；只需将其用于子集c。例如。将duplicated 更改为function (x) x[duplicate(x)]。
非常感谢@mathematical.coffee。就像：data = data%>% mutate(c = paste(a,b, sep = " "), d = vapply(lapply(strsplit(c, " "), function (x) x[duplicated(x)]), paste, character(1L), collapse = " "))
@Edu 小心。您的函数在将单词粘贴在一起后对其进行标记，这意味着它无法区分该单词是来自a 还是来自b。看看如果您首先更新数据然后运行该函数会发生什么：data$a[1] <- "the red house av av"。 “av” 显示为重复，即使它没有出现在 b 中。

标签： r string duplicates

【解决方案1】：

a = c("the red house av", "the blue sky", "the green grass")
b = c("the house built", " the sky of the city", "the grass in the garden")

data <-  data.frame(a, b, stringsAsFactors = FALSE)

func <- function(dta) {
    words <- intersect( unlist(strsplit(dta$a, " ")), unlist(strsplit(dta$b, " ")) )
    dta$c <- paste(words, collapse = " ")
    return( as.data.frame(dta, stringsAsFactors = FALSE) )
}

library(dplyr)
data %>% rowwise() %>% do( func(.) )

结果：

#Source: local data frame [3 x 3]
#Groups: <by row>
#
## A tibble: 3 x 3
#                 a                       b         c
#*            <chr>                   <chr>     <chr>
#1 the red house av         the house built the house
#2     the blue sky     the sky of the city   the sky
#3  the green grass the grass in the garden the grass

【讨论】：

【解决方案2】：

这是使用 base R 的另一种尝试（不需要包）：

df$c <- apply(df,1,function(x) 
               paste(Reduce(intersect, strsplit(x, " ")), collapse = " "))

                 # a                       b         c
# 1 the red house av         the house built the house
# 2     the blue sky     the sky of the city   the sky
# 3  the green grass the grass in the garden the grass

数据

df <- structure(list(a = c("the red house av", "the blue sky", "the green grass"
), b = c("the house built", " the sky of the city", "the grass in the garden"
)), .Names = c("a", "b"), row.names = c(NA, -3L), class = "data.frame")

【讨论】：