【问题标题】:count the transpositions needed to a string so that it can be found in another string计算一个字符串所需的换位,以便可以在另一个字符串中找到它
【发布时间】:2015-06-08 18:59:21
【问题描述】:

这是我想要做的: 当我分析的术语是“apples”时,我想知道“apples”需要多少转置才能在字符串中找到。

“立即购买苹果” => 需要 0 次换位(有苹果)。

"cheap aples online" => 需要 1 次换位(apples to aples)。

“在这里找到你的苹果” => 需要 2 次换位(苹果到苹果)。

"aple" => 需要 2 个换位(apples to aple)。

"bananas" => 需要 5 次换位(苹果到香蕉)。

stringdist 和 adist 函数不起作用,因为它们告诉我需要多少转置才能将一个字符串转换为另一个字符串。无论如何,这是我到目前为止所写的:

#build matrix
a <- c(rep("apples",5),rep("bananas",3))
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split")
d<- data.frame(a,b)
colnames(d)<-c("term","string")

#count transpositions needed
d$transpositions <- mapply(adist,d$term,d$string)
print(d)

【问题讨论】:

  • 好的,谢谢,我应该把它也添加到标题中还是标签足够?
  • 我将您的代码(在我的回答中)编辑为a &lt;- c(rep("apples",5),rep("bananas",3))中的苹果
  • 哎呀,谢谢infominer,让我也纠正一下问题!

标签: r string-comparison string-matching stringdist


【解决方案1】:

您需要先检查苹果,然后再进行换位

a <- c(rep("apples",5),rep("bananas",3))
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split")
d<- data.frame(a,b, stringsAsFactors = F)
colnames(d)<-c("term","string")

#check for apples first
d$apples <-grepl("apples", d$string)

#count transpositions needed
d$transpositions <- ifelse(d$apples ==FALSE, mapply(adist,d$term,d$string), 0)
print(d)

【讨论】:

  • 嗯我刚刚重读了你的问题,将不得不重新考虑我的答案。我会在以后处理它时发布它。您想如何处理句子而不是单字转换?
  • 坦克@infominer!非常感谢:) grepl 很有用。实际上,第一步是检测字符串中是否存在拼写正确的术语。如果没有找到拼写正确的词条,那么我需要隔离出与我的词条最相似的那段字符串,最后计算出这段字符串与词条的相似度。关于句子而不是“一个词”,我想避免“buy aple now”比“aple”得分更差,因为“buy and now”有多余的词。重要的是“buy aple now”的“aple”部分与“apple”一词的相似程度。
【解决方案2】:

所以,这是我迄今为止想出的肮脏解决方案:

#create a data.frame
a <- c(rep("apples",5),rep("banana split",3))
b <- c("buy apples now","cheap aples online","find your ap ple here","aple","bananas","cherry and bananas","pumpkin","banana split")
d <- data.frame(a,b)
colnames(d) <- c("term","string")

#split the string into sequences of consecutive characters whose length is equal to the length of the term on the same row. Calculate the similarity to the term of each sequence of characters and identify the most relevant piece of string for each row.

mostrelevantpiece <- NULL

for (j in 1:length(d$string)){
  pieces<-NULL
  piecesdist<-NULL
  for (i in 1:max((nchar(as.character(d$string[j]))-nchar(as.character(d$term[j])))+1,1)){
    addpiece <- substr(d$string[j],i,i+nchar(as.character(d$term[j]))-1)
    dist <- adist(addpiece,d$term[j])
    pieces[i] <- str_trim(addpiece)
    piecesdist[i] <- dist
    mostrelevantpiece[j] <- pieces[which.min(piecesdist)]
  }
}

#calculate the number of transpositions needed to transform the "most relevant piece of string" into the term.

d$transpositionsneeded <- mapply(adist,mostrelevantpiece,d$term)

【讨论】:

    猜你喜欢
    • 2011-01-23
    • 1970-01-01
    • 1970-01-01
    • 2015-10-08
    • 2018-10-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多