【发布时间】:2016-08-08 04:20:46
【问题描述】:
我有两个数据框,
word_table <-
word_9 word_1 word_3 ...word_random
word_2 na na ...word_random
word_5 word_3 na ...word_random
dictionary_words <-
word_2
word_3
word_4
word_6
word_7
word_8
word_9
.
.
.
word_n
我在找什么,将word_table 与dictionary_words 匹配,并将单词替换为字典中可用的单词位置,像这样,
result <-
7 na 2 ...
1 na na ...
na 2 na ...
我尝试过pmatch、charmatch、match 函数,当dictionary_words 的长度较小时,以正确的方式返回result,但当它相对较长时,例如超过 20000 个字时,@ 987654332@ 只出现在第一列,其余的列就像这样变成na。
result <-
7 na na ...
1 na na ...
na na na ...
还有其他方法可以进行字符匹配,例如使用任何应用函数?
样本
word_table <- data.frame(word_1 <- c("conflict","", "resolved", "", "", ""), word_2 <- c("", "one", "tricky", "one", "", "one"),
word_3 <- c("thanks","", "", "comments", "par",""),word_4 <- c("thanks","", "", "comments", "par",""), word_5 <- c("", "one", "tricky", "one", "", "one"), stringsAsFactors = FALSE)
colnames(word_table) <- c("word_1", "word_2", "word_3", "word_4", "word_5")
## Targeted Words
dictionary_words <- data.frame(cbind(c("abovementioned","abundant","conflict", "thanks", "tricky", "one", "two", "three","four", "resolved")))
## convert into matrix (if needed)
word_table <- as.matrix(word_table)
dictionary_words <- as.matrix(dictionary_words)
## pmatch for each of the element in the dataframe (dt)
# matched_table <- pmatch(dt, TargetWord)
# dim(matched_table) <- dim(dt)
# print(matched_table)
result <- `dim<-`(pmatch(word_table, dictionary_words, duplicates.ok=TRUE), dim(word_table))
print(result) # working fine, but when the dictionary_words is large, returning result for only first column of the word_table
【问题讨论】:
-
欢迎您!最好将您的问题与reproducible example 一起发布
-
你能显示你的代码吗?你试过
"dim<-"(match(as.matrix(word_table), dictionary_words[,1]), dim(word_table)) -
感谢 vincent,实际上我很难展示一个可重现的示例,因为正如我所提到的,当我使用相对较小的数据帧时,它运行良好。但是在使用大型数据框时,它只返回第一列结果。请找到我编辑过的样本。
-
您不需要
data.frame(cbind,只需data.frame(V1 = c(...就足够了。另外最好使用stringsAsFactors=FALSE以避免将列转换为factor -
可以发一下原始数据集的
str吗。
标签: r string character map-matching