将“字符串匹配以估计相似度”应用于数据框答案

【问题标题】：Applying "String matching to estimate similarity" to data frame将“字符串匹配以估计相似度”应用于数据框
【发布时间】：2023-04-02 11:38:01
【问题描述】：

String matching to estimate similarity

上面的代码正是我要找的，除了我似乎无法弄清楚如何比较数据框中列之间的字符串（“正确”答案和“给定”答案），然后存储来自sim.per 作为同一数据框中的新列（“相似性”）。我试过了，例如，

df$similarity <- sim.per(df$answer, df$given) 

df$similarity <- mapply(sim.per, df$answer, df$given)

当行为空时，后者也会导致错误，这在我的数据集中是可以接受的，应该计算为 0。

Error in str2[[1]] : subscript out of bounds

预期的输出应该是：

    answer                   given                              similarity
1   Best way to waste money  Instrument to waste money and time 0.6
2   Roy travels to Africa    He is in Africa                    0.25
3   I go to work                                                0

任何帮助将不胜感激！谢谢！

数据的子集：

df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L, 7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money", "He ran out of money, so he had to stop playing poker", "I go to work", "Lets all be unique together until we realise we are all the same", "Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet", "She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do", "The stranger officiates the meal", "We have a lot of rain in June"), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L, 2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker", "Instrument to waste money and time", "It was raining in June", "People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it", "She did her very best to help him out", "Students know not to cheat", "The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA, -10L))

【问题讨论】：

能否提供您正在使用的数据的样本子集？请使用 use dput(sample_data) 然后从控制台复制粘贴结果
我已编辑帖子以包含数据样本

标签： r string text-mining text-analysis

【解决方案1】：

这是一个使用tidyverse 语法的示例，以避免手动循环并使事情更简洁，可能更快。特别是，格式步骤是矢量化的，因此只有分数计算需要迭代。

library(tidyverse)

df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L, 7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money", "He ran out of money, so he had to stop playing poker", "I go to work", "Lets all be unique together until we realise we are all the same", "Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet", "She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do", "The stranger officiates the meal", "We have a lot of rain in June"), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L, 2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker", "Instrument to waste money and time", "It was raining in June", "People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it", "She did her very best to help him out", "Students know not to cheat", "The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA, -10L))

format_str <- function(string) {
  string %>%
    str_to_lower %>%
    str_remove_all("[:punct:]") %>%
    str_squish %>%
    str_split(" ")
}

df %>%
  mutate(
    similarity = map2_dbl(
      .x = format_str(answer),
      .y = format_str(given),
      .f = ~ length(intersect(.x, .y)) / length(.x)
    )
  ) %>%
  as_tibble
#> # A tibble: 10 x 4
#>    trial answer                        given                    similarity
#>    <int> <fct>                         <fct>                         <dbl>
#>  1     1 The stranger officiates the ~ The guests ate the meal       0.4  
#>  2     2 He ran out of money, so he h~ He lost money because h~      0.333
#>  3     3 Lets all be unique together ~ People are unique until~      0.231
#>  4     4 She did her best to help him  She did her very best t~      1    
#>  5     5 Best way to waste money       Instrument to waste mon~      0.6  
#>  6     6 Roy travels to Africa         He is in Africa Roy           0.5  
#>  7     7 I go to work                  ""                            0    
#>  8     8 She borrowed the book from h~ She borrowed the book f~      0.467
#>  9     9 Students did not cheat on th~ Students know not to ch~      0.25 
#> 10    10 We have a lot of rain in June It was raining in June        0.25

由reprex package (v0.2.0) 于 2018 年 8 月 17 日创建。

【讨论】：

只是一个更新，我用我更大的数据集尝试了这段代码，它似乎不像 for 循环那样工作
我没有得到正确的相似度计算。我已经比较了这两种方法之间的相似性（甚至手动编码句子），并且这段代码在计算中始终是错误的。例如，一个应该得到 1 分（或 100% 匹配）的句子被计算为 0.9285714。与其他代码相比，差异范围从 -10% 或 15% 不等
你能举一些不正确句子的例子吗？这目前给出的结果与您的示例数据的其他答案相同。
感谢您一直以来的帮助。这也是造成不一致的前导、尾随和中间空白。我在 str_squish 行中添加了代码，从而解决了问题

【解决方案2】：

您可以做到这一点的一种方法是使用 for lop 并遍历数据框中的每一行，以使用来自另一个线程的函数计算相似度百分比。

df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L, 
                                                        7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money", 
                                                                                                 "He ran out of money, so he had to stop playing poker", "I go to work", 
                                                                                                 "Lets all be unique together until we realise we are all the same", 
                                                                                                 "Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet", 
                                                                                                 "She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do", 
                                                                                                 "The stranger officiates the meal", "We have a lot of rain in June"
                                                        ), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L, 
                                                                                                  2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker", 
                                                                                                                                  "Instrument to waste money and time", "It was raining in June", 
                                                                                                                                  "People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it", 
                                                                                                                                  "She did her very best to help him out", "Students know not to cheat", 
                                                                                                                                  "The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                      -10L))

format <- function(string1){ #removing the information from the string which presumably isn't important (punctuation, capital letters. then splitting all the words into separate strings)
  lower <- tolower(string1)
  no.punct <- gsub("[[:punct:]]", "", lower)
  split <- strsplit(no.punct, split=" ")
  return(split)
}

sim.per <- function(str1, str2, ...){#how similar is string 1 to string 2. NOTE: the order is important, ie. sim.per(b,c) is different from sim.per(c,b)
  sim <- length(intersect(str1[[1]], str2[[1]]))#intersect function counts the common strings
  total <- length(str1[[1]])
  per <- sim/total
  return(per)
}

df$similarity <- 0
for (i in seq_len(nrow(df))) {
  if (!is.na(df$answer[i]) | !is.na(df$given[i])) {
    df$similarity[i] <- sim.per(format(df$answer[i]), format(df$given[i]))
  }
}

df
   trial                                                                    answer                                                          given similarity
1      1                                          The stranger officiates the meal                                        The guests ate the meal  0.4000000
2      2                      He ran out of money, so he had to stop playing poker                      He lost money because he had played poker  0.3333333
3      3          Lets all be unique together until we realise we are all the same                     People are unique until they try to fit in  0.2307692
4      4                                              She did her best to help him                          She did her very best to help him out  1.0000000
5      5                                                   Best way to waste money                             Instrument to waste money and time  0.6000000
6      6                                                     Roy travels to Africa                                            He is in Africa Roy  0.5000000
7      7                                                              I go to work                                                                 0.0000000
8      8 She borrowed the book from him many years ago and did not returned it yet She borrowed the book from the library and forgot to return it  0.4666667
9      9  Students did not cheat on the test, for it was not the right thing to do                                     Students know not to cheat  0.2500000
10    10                                             We have a lot of rain in June                                         It was raining in June  0.2500000

【讨论】：

此代码适用于前几行，其余行的答案为 0。如何修改编码来解决这个问题？我添加了使用此代码运行的示例数据集
对不起，我用错了顺序；查看编辑后的答案。
谢谢！我已经用我更大的数据集（我有来自 30 个参与者的大约 100 个句子 = ~ 3000 个数据点）尝试了编辑后的代码，并且在大多数情况下，它非常准确，但一些计算仍然是错误的（最好的猜测，15 - 20% 的数据）。它可能不是来自您的代码，但可能是原始链接..您有什么想法吗？
在没有看到数据的情况下，我不知道为什么某些计算不起作用。可以发一些计算错误的案例吗？
我想通了，谢谢！有些句子有前导/尾随空格（只有在我使用 dput() 构建示例案例时才变得明显）。我错误地认为用于删除所有其他标点符号的代码也会删除空格