如何在r中找到标签或单词的频率答案

【问题标题】：how to find the frequency of a tag or a word in r如何在r中找到标签或单词的频率
【发布时间】：2020-10-20 16:12:54
【问题描述】：

我正在处理堆栈溢出数据转储 .csv 文件，我需要找到：

数据集中出现频率最高的 8 个标签。为此，我在 data1.PostTypeId 列中看到与每一行关联的标签集。标签的频率等于具有该标签的问题的数量标记。（这意味着标记的频率等于具有该标记的行数）

注意1：文件太大，超过一百万行

注意2：我是 R 的初学者，所以我需要最简单的方法。我的尝试是使用表函数，但我得到的是标签列表，我无法找出最上面的标签

这是我使用的表格示例如下：

比如说“java”的频率最高（因为它出现在所有行中最多）

那么标签“python-3.x”是第二高频率（因为在所有行中出现次数最多）所以基本上我需要检查表中的第二列以及那里的前 8 列是什么

等等……

【问题讨论】：

请提供reproducible example和预期结果
我加了一个例子
在哪里？我没看到。请按照我为您提供的链接制作可重现示例
reproducible 是这里的关键字@user8863554

标签： r

【解决方案1】：

将基础 R 与（可选）magrittr 管道一起使用以提高可读性：

library(magrittr)
# Make a vector of all the tags present in data
tags_sep <- tags %>%
  strsplit("><") %>%
  unlist()
# Clean out the remaining < and >
tags_sep <- gsub("<|>", "", tags_sep)
# Frequency table sorted
tags_table <- tags_sep %>%
  table() %>%
  sort(decreasing = TRUE)
# Print the top 10 tags
tags_table[1:10]

      java             android          amazon-ec2 amazon-web-services android-mediaplayer 
          4                   2                   1                   1                   1 
      antlr              antlr4        apache-kafka              appium             asp.net 
          1                   1                   1                   1                   1

数据

tags <- c(
  "<java><android><selenium><appium>",
  "<java><javafx><javafx-2>",
  "<apache-kafka>",
  "<java><spring><eclipse><gradle><spring-boot>",
  "<c><stm32><led>",
  "<asp.net>",
  "<python-3.x><python-2.x>",
  "<http><server><Iocalhost><ngrok>",
  "<java><android><audio><android-mediaplayer>",
  "<antlr><antlr4>",
  "<ios><firebase><swift3><push-notification>",
  "<amazon-web-services><amazon-ec2><terraform>",
  "<xamarin.forms>",
  "<gnuplot>",
  "<rx-java><rx-android><rx-binding>",
  "<vim><vim-plugin><syntastic>",
  "<plot><quantile>",
  "<node.js><express-handlebars>",
  "<php><html>"
)

【讨论】：

【解决方案2】：

如果我理解正确，这应该可以解决您的问题

library(stringr)
library(data.table)

# some dummy data
dat = data.table(id = 1:3, tags = c("<java><android><selenium>",
                                    "<java><javafx>",
                                    "<apache><android>"))

tags = apply(str_split(dat$tags, pattern = "><", simplify = T),
             2, function(x) str_replace(x, "<|>", "")) # separate one tag in each column

foo = cbind(dat[, .(id)], tags) # add the separated tags to the data
foo[foo==""] = NA # substitute empty strings with NA
foo = melt.data.table(foo, id.vars = "id") # transform to long format
foo = foo[, .N, by = value] # calculate frequency
foo[, .SD[N %in% head(N, n = 1)]] # change the value of "n" to the number you want

     value N
1:    java 2
2: android 2
3:      NA 2

【讨论】：