在 R 中的 CSV 文件中的 100 万行中搜索一组 20 个单词 [关闭]答案

【问题标题】：Search for a group of 20 words exists in a line of 1 million in a CSV file in R [closed]在 R 中的 CSV 文件中的 100 万行中搜索一组 20 个单词 [关闭]
【发布时间】：2021-10-24 23:32:54
【问题描述】：

我正在尝试在 R 编程中使用 20 到 30 个单词搜索包含 100 万行文本的 CSV 文件。

我已将单词保存在一个键中并为每个单词分配值。我想找到每一行有这些单词并创建一个列并累积分数。

word <- c("U.S. Capital", "Biden", "Congress", "Marines", "Senate", "Santa")

value <- c(-0.5, -0.6, -0.4, -0.2, -0.4, -0.03)

【问题讨论】：

您是在尝试对原始文件进行操作，还是将其读入 R？你能给出一个文件（和/或导入的数据）是什么样子的样本（只需要十几行）以及该样本数据的预期输出吗？（这将有助于改变您的样本数据，以便您有一些匹配和一些不匹配。）

标签： r key detect

【解决方案1】：

欢迎来到 StackOverflow！如果您添加更多细节，我可以完善我的答案，但这里有一些东西可以帮助您入门。

library(data.table)

## Load your csv file
#search_in <- fread("path/to/file.csv")

## In lieu of a csv, create a table of example text values to search within
search_in <- data.table(text=c(
  "Visit the U.S. Capital and see Congress in action",
  "Santa Clause is (a) real (movie)",
  "The Marines were founded in 1775",
  "What does the fox say?",
  "The United States Senate is the upper chamber of the United States Congress"))

## Create a table of your search terms and the corresponding values
search_for <- data.table(
  word=c("U.S. Capital", "Biden", "Congress", "Marines", "Senate", "Santa"),
  value=c(-0.5, -0.6, -0.4, -0.2, -0.4, -0.03))

search_res <- merge(search_in[, id:=1L], search_for[, id:=1L], by="id", allow.cartesian=TRUE)[, 
  match:=text %like% word, by=.(text, word, value)][
    match==TRUE, .(words=paste(sort(word), collapse=", "), value=sum(value)), by=text]

search_res <- merge(search_in[, -"id"], search_res, on="text", all.x=TRUE)
search_res

##                                                                          text                  words value
##1:                           Visit the U.S. Capital and see Congress in action Congress, U.S. Capital -0.90
##2:                                            Santa Clause is (a) real (movie)                  Santa -0.03
##3:                                            The Marines were founded in 1775                Marines -0.20
##4: The United States Senate is the upper chamber of the United States Congress       Congress, Senate -0.80
##5:                                                      What does the fox say?                   <NA>    NA

创建search_res 的第一行代码连接来自search_in 和search_for 的所有行，在text 列中添加一个指示搜索词是否匹配的列，对匹配的行进行子集，然后求和值。

之后的行将原始 search_in 连接到结果中，因此您可以看到没有关键字匹配的文本行。

根据您的数据大小，这可能就足够了。如果您使用的是 Linux 或 macOS，则可以使用 grep or a similar bash solution 进行调查。

【讨论】：

感谢您的帮助。我会尝试解决方案并与您联系。