【发布时间】:2017-08-10 17:17:45
【问题描述】:
我有一个现有代码用于创建文档中所有二元组的表,但它删除了撇号。如何调整此代码以将“我已经”之类的词视为一个词?
text1 = scan(file.choose(), what="character",sep="\n")
text1 <- tolower(text1)
tokens <- unlist(strsplit(text1, "[^a-z]+"))
tokens <- mytable[tokens != ""]
tokens2 <- c(tokens[-1], ".")
bigrams <- paste(tokens, tokens2)
freq <- sort(table(bigrams), decreasing=T)
write.csv(file = "bigram count.csv" , x=freq, row.names = FALSE)
例如,短语“I've had fun”会输出'i've had'和'had fun'
【问题讨论】:
标签: r text text-analysis