【问题标题】:Replace quanteda tokens through regex通过正则表达式替换 quanteda 令牌
【发布时间】:2021-03-16 08:19:34
【问题描述】:

我想明确替换 quanteda 包的 tokens 类的对象中定义的特定标记。我无法复制一种适用于 stringr 的标准方法。

目标是将"XXXof" 形式的所有标记替换为c("XXX", "of") 形式的两个标记。

请看下面的最小值:

suppressPackageStartupMessages(library(quanteda))
library(stringr)

text = "It was a beautiful day down to the coastof California."

# I would solve this with stringr as follows: 
text_stringr = str_replace( text, "(^.*?)(of)", "\\1 \\2" )
text_stringr
#> [1] "It was a beautiful day down to the coast of California."

# I fail to find a similar solution with quanteda that works on objects of class tokens
tok = tokens( text )

# I want to replace "coastof" with "coast"
tokens_replace( tok, "(^.*?)(of)", "\\1 \\2", valuetype = "regex" )
#> Tokens consisting of 1 document.
#> text1 :
#>  [1] "It"         "was"        "a"          "beautiful"  "day"       
#>  [6] "down"       "to"         "the"        "\\1 \\2"    "California"
#> [11] "."

有什么解决方法吗?

reprex package (v1.0.0) 于 2021-03-16 创建

【问题讨论】:

    标签: r regex quanteda


    【解决方案1】:

    您可以使用混合构建需要分隔的单词及其分隔形式的列表,然后使用tokens_replace() 执行替换。这样做的好处是允许您在应用列表之前对其进行整理,这意味着您可以验证您没有发现您可能不想应用的替代品。

    suppressPackageStartupMessages(library("quanteda"))
    
    toks <- tokens("It was a beautiful day down to the coastof California.")
    
    keys <- as.character(tokens_select(toks, "(^.*?)(of)", valuetype = "regex"))
    vals <- stringr::str_replace(keys, "(^.*?)(of)", "\\1 \\2") %>%
      strsplit(" ")
    
    keys
    ## [1] "coastof"
    vals
    ## [[1]]
    ## [1] "coast" "of"
    
    tokens_replace(toks, keys, vals)
    ## Tokens consisting of 1 document.
    ## text1 :
    ##  [1] "It"         "was"        "a"          "beautiful"  "day"       
    ##  [6] "down"       "to"         "the"        "coast"      "of"        
    ## [11] "California" "."
    

    reprex package (v1.0.0) 于 2021-03-16 创建

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-07-25
      • 1970-01-01
      • 1970-01-01
      • 2013-01-25
      • 2019-01-24
      相关资源
      最近更新 更多