如何获取R中字符串的前10个单词？答案

【问题标题】：How to get the first 10 words in a string in R?如何获取R中字符串的前10个单词？
【发布时间】：2014-01-12 22:13:11
【问题描述】：

我在 R 中有一个字符串为

x <- "The length of the word is going to be of nice use to me"

我想要上面指定字符串的前 10 个单词。

例如，我有一个 CSV 文件，其格式如下所示：-

Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston

我只想从每行的“关键字”列中获取前 10 个单词并将其写入 CSV 文件。请在这方面帮助我。

【问题讨论】：

标签： r csv

【解决方案1】：

正则表达式（regex）使用\w（单词字符）及其否定\W回答：

gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)

^ 标记的开头（零宽度）
((\\w+\\W+){9}\\w+) 十个由非单词分隔的单词。
1. (\\w+\\W+){9} 一个词后非一个词，9 次
  1. \\w+一个或多个单词字符（即一个单词）
  2. \\W+ 一个或多个非单词字符（即空格）
  3. {9}九次重复
2. \\w+第十字
.* 其他任何内容，包括以下其他词语
$ 标记结束（零宽度）
\\1 找到此令牌后，将其替换为第一个捕获的组（10 个单词）

【讨论】：

非常适合我，如何理解正则表达式以备将来使用？
为什么不只是gsub("^((\\w+\\W+){10}).*","\\1",x)？
@thelatemail 这包括尾随空格（如果存在），尽管如果最后有尾随空格但总共不超过 10 个单词，建议的方法也可以。
但是将正则表达式更改为"^((\\w+\\W+){0,9}\\w+).*" 也可以解决这个问题。

【解决方案2】：

使用 Hadley Wickham 的 stringr 包中的 word 函数怎么样？

word(string = x, start = 1, end = 10, sep = fixed(" "))

【讨论】：

【解决方案3】：

这是一个小函数，它取消列出字符串，将前十个单词作为子集，然后将其粘贴回去。

string_fun <- function(x) {
  ul = unlist(strsplit(x, split = "\\s+"))[1:10]
  paste(ul,collapse=" ")
}

string_fun(x)

df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
                 This is an experimental basis program string is or are in,Seattle
                 Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)

df <- as.data.frame(df)

使用 apply（该函数在第二列中没有执行任何操作）

df$Keyword <- apply(df[,1:2], 1, string_fun)

编辑可能这是更通用的函数使用方式。

df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))

print(df)
#                      Keyword                            City.Column.Header.
# 1    The length of the string should not be more than            New York
# 2  The Keyword should be of specific length is or are         Los Angeles
# 3  This is an experimental basis program string is or             Seattle
# 4      Please help me with getting only the first ten              Boston

【讨论】：

函数的第二行可以简化为：paste(ul,collapse=" ")
r 中 unlist 的库是什么？@Martin Bel
unlist() 在 base 中，无需加载！阅读文档?unlist

【解决方案4】：

x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)

【讨论】：

正确的想法，但并不完全正确。试试head(unlist(strsplit(x, split = "\\s+")),10)