数据框中的字长操作答案

【问题标题】：Word length manipulation in data frame数据框中的字长操作
【发布时间】：2020-06-20 16:07:44
【问题描述】：

我正在打印一个数据框，该数据框应列出任何简单文本文档中的单词、长度和频率。我已经设置了所有内容，但是 1) 长度不计算字符数，我不确定它实际计算的是什么； 2) 我需要将单词列表从最长的单词重新组织到最短的单词列表，以便最终打印出列表。

file <- c(scan("a.txt",character()))
file <- as.data.frame(table(file))

Freq <- file$Freq
Word <- file$file
Len <- sapply(c(Word),nchar)

一个平面a.txt 文件，包含以下内容：

the the the bus ran over two two people and when

打印

Word Len Freq
1    and   1    1
2    bus   1    1
3   over   1    1
4 people   1    1
5    ran   1    1
6    the   1    3
7    two   1    2
8   when   1    1

Len 应该是字母的长度，但在这里它总是计为 1 - 在更长的测试中，它有时会说 2，所以我不确定它算什么。在此之后，它会打印：

[1] and    bus    over   people ran    the    two    when  
Levels: and bus over people ran the two when

我正在尝试按从最长到最短的顺序打印完整的单词。我应该可以使用Len 对单词进行排序，但我似乎无法让sapply 正常工作。

【问题讨论】：

您能否在问题中提供dput(head(file))，仅提供前几行和预期答案。
单词是向量吗？如果是这样，您应该可以直接使用 nchar 。

标签： r

【解决方案1】：

您可以尝试使用as.character() 将file$file（这是这里的一个因素）转换为字符串，并使用简单的nchar() 计算其字符数而不使用sapply()，因为R 是矢量化的。

file <- c(scan("a.txt",character()))
file <- as.data.frame(table(file))

Freq <- file$Freq
Word <- as.character(file$file)

Len <- nchar(Word)

x <- data.frame(Word, Len, Freq)
print(x)
print(Word[order(Len, decreasing = T)])

通过order()订购。

结果：

print(x)
#     Word Len Freq
# 1    and   3    1
# 2    bus   3    1
# 3   over   4    1
# 4 people   6    1
# 5    ran   3    1
# 6    the   3    3
# 7    two   3    2
# 8   when   4    1

print(Word[order(Len, decreasing = T)])
# [1] "people" "over"   "when"   "and"    "bus"    "ran"    "the"    "two"

【讨论】：

有没有办法让数据框也按 Len 排序？
@user12027316 是的，试试x[order(Len, decreasing = T), ]

【解决方案2】：

使用Lorem Ipsum 生成的文本，此指令序列可以满足问题的要求。

Word <- scan(file = 'a.txt', what = character())

Word <- gsub('[[:punct:]]', '', Word)    # remove punctuation characters
Word <- tolower(Word)                    # all characters lower case
tbl <- table(Word)                       # now get their frequencies
Len <- nchar(names(tbl))                 # the words are the table's names
x <- as.data.frame(tbl)                  # to data.frame
x$Len <- Len                             # assign the lengths column

数据现在按字典顺序排列。如果x$Word 的类是"factor"，则在对as.data.frame 的调用中使用参数stringsAsFactors = FALSE。

最后，按Len 排序并分配新的行号。

x <- x[order(x$Len, decreasing = TRUE), ]
row.names(x) <- NULL
head(x)
#          Word Freq Len
#1 sollicitudin    3  12
#2 pellentesque    4  12
#3  ullamcorper    5  11
#4  suspendisse    1  11
#5  scelerisque    2  11
#6  consectetur    2  11

【讨论】：

【解决方案3】：

我没有你的数据，但你可能会做这样的事情。 $ 按名称提取数据，因此 file$Freq 从 data.frame file 获取列 Freq。

file$Len <- nchar(file$file)

x <- file[,c('file', 'Len', 'Freq')]
names(x) <- c('Word', 'Len', 'Freq')

【讨论】：

【解决方案4】：

length() 正在计算向量长度。例如：

x <- c("apple", "pie", "math", "this is sentance")
x
[1] "apple"            "pie"              "math"             "this is sentance"
length(x)
[1] 4

x 是长度为 4 的字符向量（它有 4 个元素）。如果要计算字符向量中每个元素的字符数，请使用 nchar()：

nchar(x)
[1]  5  3  4 16

如您所见，nchar() 是矢量化的 - 它计算字符向量中每个元素的字符数（不仅仅是字母）。

【讨论】：