【问题标题】:converting multiple text value in one column into different column将一列中的多个文本值转换为不同的列
【发布时间】:2022-01-14 04:37:19
【问题描述】:

我有一个来自调查平台的大 df 文件。这些列包含一个带有所有潜在答案的问题(作为标题)。我正在尝试将每个答案列分成一组新列(答案名称作为新列的名称)。

之后,我希望每个新列都表明该值是否出现在原始列中(带有“1”),以便更轻松地处理数据

     df<-data.frame("name"= c("John","mark","bell","elsa"),"what do you like to 
     eat"=c("apple","fries apple","peach","bread"))

原始文件

name What.do.you.like.to.eat
John apple
Mark fries apple
bell peach
elsa bread

我正在使用这个有效的代码,但我确信必须有更有效/更简单的方法,因为我有超过 50 个这样的列。

    df<-df %>%
  separate(what.do.you.like.to.eat, c("apple","fries","peach","bread",NA ), remove = F)
df[,3:6]<-""
{
  df[,3] =  with(df, ifelse(grepl("apple", df$what.do.you.like.to.eat,ignore.case = T), 
                                             paste('1', df[,3]), 
                                             paste("", df[,3])))
  df[,4] =  with(df, ifelse(grepl("fries", df$what.do.you.like.to.eat,ignore.case = T), 
                            paste('1', df[,4]), 
                            paste("", df[,4])))
  df[,5] =  with(df, ifelse(grepl("peach", df$what.do.you.like.to.eat,ignore.case = T), 
                            paste('1', df[,5]), 
                            paste("", df[,5])))
  df[,6] =  with(df, ifelse(grepl("bread", df$what.do.you.like.to.eat,ignore.case = T), 
                            paste('1', df[,6]), 
                            paste("", df[,6])))
}

希望输出

name What.do.you.like.to.eat apple fries peach bread
John apple 1
Mark fries apple 1 1
bell peach 1
elsa bread 1

【问题讨论】:

  • 老实说,我很难理解它有点含糊,所以您能否提供数据样本和预期结果样本?
  • 刚做了,投错了没有例子

标签: r dataframe nlp tidyr


【解决方案1】:

好的,我已经完成了,告诉我这是否适合你:

my_df <- data.frame("name" = c("John","mark","bell","elsa"),
                "what do you like to eat" = c("apple","fries apple","peach","bread"),
                stringsAsFactors = FALSE)
my_var <- unique(sort(str_split(string = my_df$what.do.you.like.to.eat, pattern = " ", simplify = TRUE)))
my_pos <- which(my_var == "")
if (length(my_pos)) {
  my_var <- my_var[-my_pos]
}
my_col <- c(colnames(my_df), my_var)
my_miss <- setdiff(my_col, colnames(my_df))
my_df[my_miss] <- NA
my_f <- function(x, y) {
  my_var <- grep(pattern = colnames(my_df)[x], x = my_df[, y])
  if (length(my_var)) {
    my_df[my_var, x] <<- 1
  }
}
lapply(3:ncol(my_df), function(x) my_f(x, 2))

你可以把这部分改成这样:

my_df <- data.frame("name" = c("John","mark","bell","elsa"),
                "what do you like to eat" = c("i like apple","i love fries apple","i'm kind of peach","bread all the way"),
                stringsAsFactors = FALSE)
my_var <- unique(sort(str_split(string = 
my_df$what.do.you.like.to.eat, pattern = " ", simplify = TRUE)))
my_food <- c("apple", "fries", "bread", "peach")
my_var <- my_var[which(my_var %in% my_food)]
my_pos <- which(my_var == "")
if (length(my_pos)) {
  my_var <- my_var[-my_pos]
}

【讨论】:

  • 这个例子很好用,谢谢。问题在于,在原始 df 中,答案主要是不同的字符串,例如“我喜欢吃苹果”;“我非常喜欢面包”;等有没有办法定义 my_var 以便它可以检测到这样的字符串,还是必须手动将它添加到这个变量中?
【解决方案2】:

您可以使用purrr::map 来应用您的答案向量,并为每个答案检查它们在字符串中的存在。

library(tidyverse)

df <- data.frame(
  name = c("John", "mark", "bell", "elsa"),
  "what do you like to eat" = c("apple", "fries apple", "peach", "bread")
)

ans <- c("apple", "fries", "peach", "bread")

map_dfc(ans,~ transmute(df, !!sym(.x) := str_detect(what.do.you.like.to.eat, .x))) %>%
  bind_cols(df, .)
#>   name what.do.you.like.to.eat apple fries peach bread
#> 1 John                   apple  TRUE FALSE FALSE FALSE
#> 2 mark             fries apple  TRUE  TRUE FALSE FALSE
#> 3 bell                   peach FALSE FALSE  TRUE FALSE
#> 4 elsa                   bread FALSE FALSE FALSE  TRUE

【讨论】:

  • 感谢这很好用,也适用于字符串。是否可以在数据框中的特定位置将结果列与 bind_cols 函数绑定。? (类似于:dplyr::relocate(ans, .after = name)
猜你喜欢
  • 1970-01-01
  • 2022-11-29
  • 2020-06-08
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-13
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多