【问题标题】:How can I match strings with at least one word in common in R?如何在 R 中匹配至少有一个共同单词的字符串?
【发布时间】:2021-07-05 15:03:51
【问题描述】:

数据框 1 示例:

NAME;                        CITY; STATE;  SURNAME;
Maria Antonia Sousa          A     X       Antonia Sousa
Josep Oliveira Carlos        A     X       Oliveira Carlos 
Jose Mario Augusto Farias    B     Y       Augusto Farias
Andre Gois Lucas             B     Y       Gois Lucas

我想在第二个数据框中创建一个列familyDummy,以指示与第一个数据框中的姓氏共享至少一个姓氏的人,但前提是他们来自同一城市和州。同一个人可能会出现在两个 df 中,我不想将他们视为家人。 df 的长度不同。

数据框 2 示例:

NAME;                    CITY;  STATE;    SURNAME;          familyDummy;
Maria Antonia Sousa      A      X         Antonia Sousa     0
Angela Oliveira Santos   A      X         Oliveira Santos   1
Fabio Silva Carlos       B      Y         Silva Carlos      0
Luan Gois Lucas          B      Y         Gois Lucas        1

感谢您的帮助。

【问题讨论】:

    标签: r string parsing text


    【解决方案1】:

    这里有一个解决您的问题的解决方案。该解决方案首先将 df1 和 df2 的 SURNAME 列分为两个姓氏,以检查单个匹配项(请参阅 df1_bis 和 df2_bis)。然后,它循环遍历 df2 的所有条目以检查在 df1 中是否找到确切的 NAME,以及是否在 df1 中找到 df2 的每个条目的至少一个姓氏。如果满足这两个条件,它会检查这些条目的CITYSTATE 是否在df1 和df2 中匹配。如果是这种情况,则将 familyDummy 分配为 1,如果不是,则分配为 0。

    library(tidyverse)
    
    # Your data
    df1 <-structure(list(NAME = c("Maria Antonia Sousa", "Josep Oliveira Carlos", 
    "Jose Mario Augusto Farias", "Andre Gois Lucas"), CITY = c("A", 
    "A", "B", "B"), STATE = c("X", "X", "Y", "Y"), SURNAME = c("Antonia Sousa", 
    "Oliveira Carlos", "Augusto Farias", "Gois Lucas")), class = "data.frame", row.names = c(NA, 
    -4L))
    
    df2 <- structure(list(NAME = c("Maria Antonia Sousa", "Angela Oliveira Santos", 
    "Fabio Silva Carlos", "Luan Gois Lucas"), CITY = c("A", "A", 
    "B", "B"), STATE = c("X", "X", "Y", "Y"), SURNAME = c("Antonia Sousa", 
    "Oliveira Santos", "Silva Carlos", "Gois Lucas"), familyDummy = c(0L, 
    1L, 0L, 1L)), class = "data.frame", row.names = c(NA, -4L))
    
    # Divide surnames
    df1_bis <- df1 %>%
      # Divide SURNAME into two surnames to check independently for each single surname
      mutate(surname1 = str_extract(SURNAME,"[A-z]+(?=\\s)"),
             surname2 = str_extract(SURNAME,"(?<=\\s)[A-z]+"))
    
    df2_bis <- df2 %>%
      # Divide SURNAME into two surnames to check independently for each single surname
      mutate(surname1 = str_extract(SURNAME,"[A-z]+(?=\\s)"),
             surname2 = str_extract(SURNAME,"(?<=\\s)[A-z]+"))
    
    df2 %>%
    # Add the result as another column
      # Use map to cycle over each row in df2
      mutate(familyDummy = map(1:nrow(df2_bis), function(i){
        # Check if the same NAME is in df1 and df2, if it appears assign 0, if not, 1.
        dif_name = str_detect(df2_bis$NAME[i], df1_bis$NAME, negate = T)
    
        # Check if any of the surnames of df1 is in df2. If it appears, assign 1, if not 0,
        surname_same = ifelse(str_detect(df2_bis$surname1[i], df1_bis$surname1) | str_detect(df2_bis$surname1[i], df1_bis$surname2) | str_detect(df2_bis$surname2[i], df1_bis$surname1) | str_detect(df2_bis$surname2[i], df1_bis$surname2), 1, 0)
    
        # Get the indices in df1 of the cases that meet the two latter criteria
        temp <- which(dif_name == 1 & surname_same == 1)
    
        # Check if there are cases where at least one entry matches the two criteria
        if(length(temp) >= 1){
          # Check if city and state in df1 matches that in df2
          # I used %in% instead of == because there might be more than 1 match
          familyDummy = ifelse(df2_bis$CITY[i] %in% df1_bis$CITY[temp] & df2_bis$STATE[i] %in% df1_bis$STATE[temp], 1, 0)
          }else{ # If no case match the previous two criteria return 0
            familyDummy = 0
            }
        return(familyDummy)
        }))
    
    #                    NAME CITY STATE         SURNAME familyDummy
    #1    Maria Antonia Sousa    A     X   Antonia Sousa           0
    #2 Angela Oliveira Santos    A     X Oliveira Santos           1
    #3     Fabio Silva Carlos    B     Y    Silva Carlos           0
    #4        Luan Gois Lucas    B     Y      Gois Lucas           1
    

    【讨论】:

    • 我正在尝试在其他两个 df 上运行此代码,但出现错误:错误:mutate() 输入问题familyDummy。 x 输入 familyDummy 无法回收到大小 2。 i 输入 familyDummymap(...)。 i 输入familyDummy 的大小必须为 2 或 1,而不是 10825。 i 组 1 中发生错误:CITY = "X",STATE = "Y"。你能帮我解决这个问题吗?提前致谢
    • 嗯,其他两个 df 应该可以正常工作。尝试检查前两个 df 和第二个 df 之间是否存在差异(例如,列名的差异)。
    • 五。 Solózano,我想知道您是否可以帮助我针对不同的情况更改此代码。我现在有五个不同的 dfs,每年一个(2000 年、2004 年、2008 年、2012 年和 2016 年),我决定将它们一起加入。所以现在我有一个不平衡的面板。我希望变量“familyDummy”为每个城市的每个人识别前几年的亲戚。现在我还有一个唯一的代码来识别每个城市。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-10-16
    • 1970-01-01
    相关资源
    最近更新 更多