R - 将具有功能的列内容更改为新内容答案

【问题标题】：R - change column content with function to new contentR - 将具有功能的列内容更改为新内容
【发布时间】：2020-11-19 17:55:32
【问题描述】：

好吧，我的问题是，我有一个数据框，其中有一列物种名称如下：

属1种1
属1种2
属2种3

我只是想像这样删除属名：

G.物种1
G.物种2
G.物种3

我的问题不在于字符串处理。我只是想不通如何简单地使用我的函数，如下所示：

truncateGenusName <- function(x){
  genus <- str_sub(x, 1, 1) 
  posOfSpace <- str_locate(x, " ")[1] 
  epitheton <- str_sub(x, posOfSpace, str_length(x)) 
  paste0(genus, '. ', epitheton)
}

我试图这样调用函数：
mutate(data, species = replace (species, TRUE, truncateGenusName(species)))

这个调用的问题是传递给函数的不是一个字符串，而是一个列表。

我想我的方法太复杂了。也许有人有捷径？

【问题讨论】：

标签： r dplyr

【解决方案1】：

这是使用dplyr、tidyr 和stringr 的解决方案。

tidyr 包中的separate 用于将names 列转换为两个新列genus 和species。

来自stringr 包的str_trunc 截断genus 列中的字符串。请注意，默认设置是应用三个省略号，表示字符串已被截断。使用 ellipsis = "." 会覆盖默认仅显示 1 个椭圆。

paste 有一个默认的sep = " "，所以我认为它比paste0 更可取。

library(tibble) # Used to create a reproducible example. 
library(dplyr)
library(tidyr)
library(stringr)

# create a reproducible example dataset using the tibble package. 

df <- tibble(
  names = c("Genus1 species1",
            "Genus1 species2",
            "Genus1 species3")
)


# code to format the names column. 

df %>% 
  separate(names, c("genus", "species"), sep = " ") %>% 
  mutate(genus = str_trunc(genus, 2, ellipsis = ".")) %>% 
  mutate(
    genus_species = paste(genus, species)
  ) %>% 
  rename(names = genus_species) %>%
  select(names)

# output table

#> # A tibble: 3 x 1
#>   names      
#>   <chr>      
#> 1 G. species1
#> 2 G. species2
#> 3 G. species3

^{由reprex package (v0.3.0) 于 2020 年 11 月 19 日创建}

原表：

#> # A tibble: 3 x 1
#>   names          
#>   <chr>          
#> 1 Genus1 species1
#> 2 Genus1 species2
#> 3 Genus1 species3

【讨论】：

好的，这几乎就是我想要的，但我不确定它是否也可以在不创建新列的情况下工作？我只想减少开销并简单地重用该列
@basto - 我已经更新了我的解决方案。根据您的生产数据集，您可能需要添加 select(names, everything())。

【解决方案2】：

基础 R 中的简单单行代码：

dat <- read.table(text = "
    Genus1 species1
    Genus1 species2
    Genus2 species3
")

dat$V3 <- paste0(gsub("(?<=^.{1}).*", "", dat$V1, perl = TRUE), ". ", dat$V2)

dat

#       V1       V2          V3
# 1 Genus1 species1 G. species1
# 2 Genus1 species2 G. species2
# 3 Genus2 species3 G. species3

因为我误解了 OP，所以一个新的单线解决方案（仍然只是基础 R）：

dat <- read.table(text = "
    Genus1 species1
    Genus1 species2
    Genus2 species3
")

dat$V3 <- paste0(dat$V1, " ", dat$V2)
dat <- dat[, 3, drop = FALSE]
dat

#                V3
# 1 Genus1 species1
# 2 Genus1 species2
# 3 Genus2 species3

#One-liner below:
dat$V3 <- gsub("(?<=^.{1})[A-Za-z0-9]+(?= )", ".", dat$V3, perl = TRUE)

dat
#            V3
# 1 G. species1
# 2 G. species2
# 3 G. species3

【讨论】：

我发现很难阅读正则表达式，这就是为什么我尽量避免使用它们，如果属名和种名在同一列中，这个解决方案将不起作用，这是很正常的在生物数据中..
我想我误解了你的帖子。我看到我假设的单个列实际上是单个列中的空格分隔字符串。鉴于此，请注意在此处编辑为我的答案的新解决方案。不幸的是，恐怕这仍然使用正则表达式。
是的，现在它可能正在解决问题，但我仍然更喜欢没有正则表达式的答案。无论如何，非常感谢您的帮助！