【发布时间】:2017-01-09 16:20:58
【问题描述】:
我正在尝试使用 R 编程环境中的 'quanteda' 包从一个大 (1GB) 文本文件创建三元组和二元组。如果我尝试一次性运行我的代码(如下所示),R 就会挂起(在第 3 行 - myCorpus
folder.dataset.english <- 'final/corpus'
myCorpus <- corpus(x=textfile(list.files(path = folder.dataset.english, pattern = "\\.txt$", full.names = TRUE, recursive = FALSE))) # build the corpus
myCorpus<-toLower(myCorpus, keepAcronyms = TRUE)
#bigrams
bigrams<-dfm(myCorpus, ngrams = 2,verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,removeTwitter = TRUE, stem = FALSE)
bigrams_freq<-sort(colSums(bigrams),decreasing=T)
bigrams<-data.frame(names=names(bigrams_freq),freq=bigrams_freq,stringsAsFactors =FALSE)
bigrams$first<- sapply(strsplit(bigrams$names, "_"), "[[", 1)
bigrams$last<- sapply(strsplit(bigrams$names, "_"), "[[", 2)
rownames(bigrams)<-NULL
bigrams.freq.freq<-table(bigrams$freq)
saveRDS(bigrams,"dictionaries/bigrams.rds")
#trigrams
trigrams<-dfm(myCorpus, ngrams = 3,verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = TRUE, stem = FALSE)
trigrams_freq<-sort(colSums(trigrams),decreasing=T)
trigrams<-data.frame(names=names(trigrams_freq),freq=trigrams_freq,stringsAsFactors =FALSE)
trigrams$first<-paste(sapply(strsplit(trigrams$names, "_"), "[[", 1),sapply(strsplit(trigrams$names, "_"), "[[", 2),sep="_")
trigrams$last<-sapply(strsplit(trigrams$names, "_"), "[[", 3)
rownames(trigrams)<-NULL
saveRDS(trigrams,"dictionaries/trigrams.rds")
【问题讨论】:
-
你的代码在什么时候挂起?
-
它挂在 myCorpus
-
你的代码能否在较小的数据集上成功运行?
-
是的,我在一个小数据集上成功运行了它