【问题标题】:stringr package using str_detect - Search for one word and exclude word使用 str_detect 的 stringr 包 - 搜索一个单词并排除单词
【发布时间】:2021-12-14 21:12:06
【问题描述】:

我有一个示例项目,需要使用 stringr 包搜索字符串。在示例中,为了消除其他大小写拼写,我以str_to_lower(example$remarks) 开头,这使注释全部小写。备注栏描述住宅物业。

我需要搜索“shop”这个词。但是,“购物”这个词也在备注栏中,我不要那个词。

一些观察: a) 只有“shop”这个词; b) 只有“购物”二字; c) 没有“shop”或“shopping”字样; d) 有“shop”和“shopping”这两个词。

当使用str_detect() 时,我希望它给我一个TRUE 来检测单词“shop”,但我不希望它给我一个TRUE 来检测单词中的字符串“shop” “购物”。目前,如果我运行str_detect(example$remarks, "shop"),我会得到TRUE 的“shop”和“shopping”这两个词。实际上,我只想要一个 TRUE 用于 4 个字符的字符串“shop”,如果出现字符“shop”但后面有任何其他字符,如 shop(ping),我希望代码排除检测它而不识别它它是TRUE

另外,如果评论中同时包含“shop”和“shopping”这两个词,我希望结果为TRUE,仅用于检测“shop”而不是“shopping”。

最终,我希望使用str_detect() 的一行代码可以给我以下结果:

  1. 如果备注观察只有“shop”这个词=TRUE
  2. 如果备注观察只有“购物”二字=FALSE
  3. 如果评论观察中既没有“shop”也没有“shopping”=FALSE
  4. 如果备注观察同时包含单词“shop”和“shopping”=TRUE,用于仅检测 4 个字符的字符串“shop”,并且由于单词“shopping”而不会输出TRUE

我需要将所有观察结果保留在数据集中并且不能排除它们,因为我需要创建一个新列,我已将其标记为 shop_YN,它为仅包含 4 个字符的字符串的观察结果提供“是” “店铺”。一旦我有了正确的str_detect() 代码,我计划将结果包装在mutate()if_else() 函数中,如下所示(除了我不知道在str_detect() 中使用什么代码来获得我需要的结果):

shop_YN <- example %>% mutate(shop_YN = if_else(str_detect(example$remarks, ), "Yes", "No"))

这是使用dput()的数据示例:

structure(list(price = c(195000, 213000, 215000, 240000, 241000, 
                         250000, 255000, 256500, 260000, 263500, 265000, 277000, 280000, 
                         280000, 150000), remarks = c("large home with a 1200 sf shop. great location close to shopping.", 
                                                      "updated home close to shopping & schools.", "nice location. 2br home with updating.", 
                                                      "huge shop on property!", "close to shopping.", "updated, clean, great location, garage.", 
                                                      "close to shopping and massive shop on property.", "updated home near shopping, schools, restaurants.", 
                                                      "large home with updated interior.", "close to schools, updated, stick-built shop 1500sf.", 
                                                      "home and shop.", "near schools, shopping, restaurants. partially updated home.", 
                                                      "located close to shopping. high quality home with shop in backyard.", 
                                                      "brick 2-story. lots of shopping near by. detached garage and large shop in backyard.", 
                                                      "fixer! needs work.")), row.names = c(NA, -15L), class = c("tbl_df", 
                                                                                                                 "tbl", "data.frame"))

【问题讨论】:

标签: r string stringr


【解决方案1】:

您可能正在这里寻找单词边界 (\\b)。在两个单词边界之间包装所需的模式以仅匹配单词,而不是较长单词的一部分。

library(dplyr)
library(sitrngr)

df %>% mutate(shop_YN = str_detect(remarks, '\\bshop\\b'))

# A tibble: 15 × 3
    price remarks                                                                          shop_YN
    <dbl> <chr>                                                                            <lgl>  
 1 195000 large home with a 1200 sf shop. great location close to shopping.                TRUE   
 2 213000 updated home close to shopping & schools.                                        FALSE  
 3 215000 nice location. 2br home with updating.                                           FALSE  
 4 240000 huge shop on property!                                                           TRUE   
 5 241000 close to shopping.                                                               FALSE  
 6 250000 updated, clean, great location, garage.                                          FALSE  
 7 255000 close to shopping and massive shop on property.                                  TRUE   
 8 256500 updated home near shopping, schools, restaurants.                                FALSE  
 9 260000 large home with updated interior.                                                FALSE  
10 263500 close to schools, updated, stick-built shop 1500sf.                              TRUE   
11 265000 home and shop.                                                                   TRUE   
12 277000 near schools, shopping, restaurants. partially updated home.                     FALSE  
13 280000 located close to shopping. high quality home with shop in backyard.              TRUE   
14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in back… TRUE   
15 150000 fixer! needs work.                                                               FALSE

如果您想要YesNo 而不是逻辑shop_YN,只需将str_detect 的输出通过管道传输到ifelse

df %>% mutate(shop_YN = str_detect(remarks, '\\bshop\\b') %>% ifelse('Yes', 'No'))

【讨论】:

    【解决方案2】:

    我们也可以使用grepl 代替str_detect

    df %>% 
      mutate(check = grepl("\\bshop\\b", remarks))
    
        price remarks                                                                              check
        <dbl> <chr>                                                                                <lgl>
     1 195000 large home with a 1200 sf shop. great location close to shopping.                    TRUE 
     2 213000 updated home close to shopping & schools.                                            FALSE
     3 215000 nice location. 2br home with updating.                                               FALSE
     4 240000 huge shop on property!                                                               TRUE 
     5 241000 close to shopping.                                                                   FALSE
     6 250000 updated, clean, great location, garage.                                              FALSE
     7 255000 close to shopping and massive shop on property.                                      TRUE 
     8 256500 updated home near shopping, schools, restaurants.                                    FALSE
     9 260000 large home with updated interior.                                                    FALSE
    10 263500 close to schools, updated, stick-built shop 1500sf.                                  TRUE 
    11 265000 home and shop.                                                                       TRUE 
    12 277000 near schools, shopping, restaurants. partially updated home.                         FALSE
    13 280000 located close to shopping. high quality home with shop in backyard.                  TRUE 
    14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in backyard. TRUE 
    15 150000 fixer! needs work.                                                                   FALSE
    

    【讨论】:

    • 也感谢您提供这个替代方案!我还没有使用过 grepl,但我也会探索这个选项。
    猜你喜欢
    • 2017-12-03
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-03-17
    • 1970-01-01
    • 2014-12-11
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多