【问题标题】:R dynamic stop word list with terms of frequency oneR动态停用词列表,频率为一
【发布时间】:2015-09-23 23:57:17
【问题描述】:

我正在做一个文本挖掘任务,现在卡住了。以下内容基于 Zhaos Text Mining 与 Twitter。我无法让它发挥作用,也许你们中的某个人有一个好主意?

目标:我想从语料库中删除所有字数为 1 的术语,而不是使用停用词列表。

到目前为止我做了什么:我已经下载了推文并将它们转换为数据框。

tf1 <- Corpus(VectorSource(tweets.df$text))


tf1 <- tm_map(tf1, content_transformer(tolower))


removeUser <- function(x) gsub("@[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeUser))


removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeNumPunct))


removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
tf1 <- tm_map(tf1, content_transformer(removeURL))

tf1 <- tm_map(tf1, stripWhitespace)


#Using TermDocMatrix in order to find terms with count 1, dont know any other way
tdmtf1 <- TermDocumentMatrix(tf1, control = list(wordLengths = c(1, Inf)))

ones <- findFreqTerms(tdmtf1, lowfreq = 1, highfreq = 1)

tf1Copy <- tf1

tf1List <- setdiff(tf1Copy, ones)


tf1CList <- paste(unlist(tf1List),sep="", collapse=" ")

tf1Copy <- tm_map(tf1Copy, removeWords, tf1CList)

tdmtf1Test <- TermDocumentMatrix(tf1Copy, control = list(wordLengths = c(1, Inf)))

#Just to test success...
ones2 <- findFreqTerms(tdmtf1Test, lowfreq = 1, highfreq = 1)
(ones2)

错误:

gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, reduction = TRUE)), : 无效正则表达式'(*UCP)\b(高级数据科学家全球战略公司
25.0010230541229 48 17 6 6 115 1 186 0 1 en kdnuggets poll 用于分析数据挖掘数据科学的主要编程语言
25.0020229816437 48 17 6 6 115 1 186 0 2 en iapa canberra 研讨会挖掘万物互联 信息时代的官方统计 六月
25.0020229816437 48 17 6 6 115 1 186 0 3 处理和处理电子书中的字符串pdf格式页面
25.0020229816437 48 17 6 6 115 1 186 0 4 网络研讨会将您的数据输入 r by hadley wickham am edt June th
25.0020229816437 48 17 6 6 115 1 186 0 5 在加载 rdmtweets 数据集之前,请运行 librarytwitter 以加载所需的包
25.0020229816437 48 17 6 6 115 1 186 0 6 通过
了解 sas vs r vs python 数据科学的信息图 25.0020229816437 48 17 6 6 115 1 186 0 7 en r 再次成为 kdnuggets 对顶级分析数据挖掘科学软件的民意调查
25.0020229816437 48 17 6 6 115 1 186 0 8 我会跑

另外:

警告信息:在 gsub(sprintf("(*UCP)\b(%s)\b", paste(sort(words, reduction = TRUE), : PCRE 模式编译错误 '正则表达式太大' 在''

PS 很抱歉最后的错误格式无法修复。

【问题讨论】:

标签: r text-mining tm stop-words


【解决方案1】:

这是一种从语料库中删除所有字数为一个的术语的方法:

library(tm)
mytweets <- c("This is a doc", "This is another doc")

corp <- Corpus(VectorSource(mytweets))
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
# 
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
#   This is another doc
##            ^^^ 

dtm <- DocumentTermMatrix(corp)
inspect(dtm)
# Terms
# Docs another doc this
# 1       0   1    1
# 2       1   1    1

(stopwords <- findFreqTerms(dtm, 1, 1))
# [1] "another"

corp <- tm_map(corp, removeWords, stopwords)
inspect(corp)
# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# This is a doc
# 
# [[2]]
# <<PlainTextDocument (metadata: 7)>>
# This is  doc
##        ^ 'another' is gone

(附带说明:'This is a...' 中的标记 'a' 也消失了,因为 DocumentTermMatrix 默认会删除长度小于 3 的标记。)

【讨论】:

    【解决方案2】:

    下面是使用quanteda 包中的dfm()trim() 函数的更简单方法:

    require(quanteda)
    
    mydfm <- dfm(c("This is a doc", "This is another doc"), verbose = FALSE)
    mydfm
    ## Document-feature matrix of: 2 documents, 5 features.
    ## 2 x 5 sparse Matrix of class "dfmSparse"
    ## features
    ## docs    a another doc is this
    ## text1 1       0   1  1    1
    ## text2 0       1   1  1    1
    
    trim(mydfm, minCount = 2)
    ## Features occurring less than 2 times: 2 
    ## Document-feature matrix of: 2 documents, 3 features.
    ## 2 x 3 sparse Matrix of class "dfmSparse"
    ## features
    ## docs    doc is this
    ## text1   1  1    1
    ## text2   1  1    1
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-07-23
      • 2013-08-08
      • 1970-01-01
      • 2018-09-02
      • 1970-01-01
      • 2022-01-19
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多