转换后的 dfm 正在删除空的“文档”,这可能是因为通过频率修剪或模式匹配(例如删除停用词)去除了特征。 LDA 无法处理空文档,因此默认情况下会从 LDA 格式(“topicmodels”、“stm”等)中删除空文档。
从 v1.5 开始,convert() 中有一个名为 omit_empty = TRUE 的选项,如果要保留零特征文档,可以将其设置为 FALSE。
library("quanteda")
## Package version: 1.5.1
txt <- c("one two three", "and or but", "four five")
dfmat <- tokens(txt) %>%
tokens_remove(stopwords("en")) %>%
dfm()
dfmat
## Document-feature matrix of: 3 documents, 5 features (66.7% sparse).
## 3 x 5 sparse Matrix of class "dfm"
## features
## docs one two three four five
## text1 1 1 1 0 0
## text2 0 0 0 0 0
## text3 0 0 0 1 1
这是设置omit_empty = FALSE 所产生的差异:
# with and without the empty documents
convert(dfmat, to = "topicmodels")
## <<DocumentTermMatrix (documents: 2, terms: 5)>>
## Non-/sparse entries: 5/5
## Sparsity : 50%
## Maximal term length: 5
## Weighting : term frequency (tf)
convert(dfmat, to = "topicmodels", omit_empty = FALSE)
## <<DocumentTermMatrix (documents: 3, terms: 5)>>
## Non-/sparse entries: 5/10
## Sparsity : 67%
## Maximal term length: 5
## Weighting : term frequency (tf)
最后,如果您想对 dfm 进行子集化以删除空文档,只需使用 dfm_subset()。第二个参数被强制转换为一个逻辑值,当 ntoken(dfmat) > 0 和 FALSE 为 0 时将采用 TRUE 的值。
# subset dfm to remove the empty documents
dfm_subset(dfmat, ntoken(dfmat))
## Document-feature matrix of: 2 documents, 5 features (50.0% sparse).
## 2 x 5 sparse Matrix of class "dfm"
## features
## docs one two three four five
## text1 1 1 1 0 0
## text3 0 0 0 1 1