【问题标题】:how to create interactions with quanteda?如何创建与 quanteda 的交互?
【发布时间】:2021-06-15 22:46:08
【问题描述】:

考虑以下示例

library(quanteda)
library(tidyverse)

tibble(text = c('the dog is growing tall',
                'the grass is growing as well')) %>% 
  corpus() %>% dfm()
Document-feature matrix of: 2 documents, 8 features (31.2% sparse).
       features
docs    the dog is growing tall grass as well
  text1   1   1  1       1    1     0  0    0
  text2   1   0  1       1    0     1  1    1

我想在每个句子中创建dog 和其他标记之间的交互。也就是说,创建功能the-dogis-doggrowing-dogtall-dog 并将它们添加到dfm(在我们已经拥有的之上)。

也就是说,例如,如果thedog 都出现在句子中,the-dog 将等于 1(否则为零)。所以the-dog 第一句是一个,第二句是零。

请注意,我只在句子中出现dog 时才创建交互术语,因此此处不需要dog-grass

如何在quanteda 中有效地做到这一点?

【问题讨论】:

  • 你想要什么格式的输出?一个按词狗出现的文档,没有计算其他特征?
  • 谢谢@KenBenoit。我认为dfm会很棒。因此,DFM 在我们的示例中将具有以下列thedogisgrowingtallgrassaswell 和 @9876543443@、@9876@4 ,growing-dogtall-dog。我在想这些变量可以在tokens() 级别创建,但我不确定如何(tokens_skipgram() 会在这里创建许多不相关的交互)
  • 所以规则是:如果句子(即 quanteda 文档)包含dog,则将句子中的所有标记与dog进行交互

标签: r quanteda


【解决方案1】:
library("quanteda")
## Package version: 2.1.2

toks <- tokens(c(
  "the dog is growing tall",
  "the grass is growing as well"
))

# now keep just tokens co-occurring with "dog"
toks_dog <- tokens_select(toks, "dog", window = 1e5)

# create the dfm and label other terms as interactions with dog
dfmat_dog <- dfm(toks_dog) %>%
  dfm_remove("dog")
colnames(dfmat_dog) <- paste(featnames(dfmat_dog), "dog", sep = "-")
dfmat_dog
## Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
##        features
## docs    the-dog is-dog growing-dog tall-dog
##   text1       1      1           1        1
##   text2       0      0           0        0

# combine with other features
print(cbind(dfm(toks), dfmat_dog), max_nfeat = -1)
## Document-feature matrix of: 2 documents, 12 features (37.50% sparse) and 0 docvars.
##        features
## docs    the dog is growing tall grass as well the-dog is-dog growing-dog
##   text1   1   1  1       1    1     0  0    0       1      1           1
##   text2   1   0  1       1    0     1  1    1       0      0           0
##        features
## docs    tall-dog
##   text1        1
##   text2        0

reprex package (v1.0.0) 于 2021-03-18 创建

【讨论】:

  • 我知道如果 cbind 调用中的两个 dfms 包含重叠的列名(即单词),以后可能会出现问题。我们可以 cbind dfms 并避免重复吗?
  • 是的,只要在 cbind-ed dfm 上调用 dfm_compress(),它就会结合重复的特征。
猜你喜欢
  • 1970-01-01
  • 2011-07-07
  • 2015-07-16
  • 2021-12-05
  • 1970-01-01
  • 2012-04-28
  • 2020-10-29
  • 2012-08-21
  • 1970-01-01
相关资源
最近更新 更多