【问题标题】:Expand Rows and Add Columns in Data Frame Based On Another Data Frame基于另一个数据框在数据框中展开行并添加列
【发布时间】:2018-08-04 03:50:40
【问题描述】:

概述

team.df 中的每一行都包含一个NBA teamlist.of.all.stars 中的每个数据框都包含多行,这些行基于与每个 NBA 球队关联的 all star players 的数量。

使用apply() 系列函数,我如何扩展team.df 中的行以增加每支球队全明星球员的数量合并list.of.all.stars 中的列到最终输出?

我对非apply() 方法也完全持开放态度,只是想举一个例子,我希望避免编写 for 循环。

以下是我想要的输出:

#   Team_Name Team_Location         Player Captain
# 1 Cavaliers Cleveland, OH   LeBron James    TRUE
# 2 Cavaliers Cleveland, OH     Kevin Love   FALSE
# 3  Warriors   Oakland, CA  Stephen Curry    TRUE
# 4  Warriors   Oakland, CA   Kevin Durant   FALSE
# 5  Warriors   Oakland, CA  Klay Thompson   FALSE
# 6  Warriors   Oakland, CA Draymond Green   FALSE

可重现的示例

# create data frame 
# about team information
team.df <-
  data.frame(
    Team_Name       = c( "Cavaliers", "Warriors" )
    , Team_Location = c( "Cleveland, OH", "Oakland, CA")
    , stringsAsFactors = FALSE
  )

# create list about
# all stars on each team
list.of.all.stars <-
  list( 
    data.frame(
      Player = c( "LeBron James", "Kevin Love" )
      , Captain = c( TRUE, FALSE )
      , stringsAsFactors = FALSE
    )
    , data.frame( 
      Player = c( "Stephen Curry", "Kevin Durant"
                  , "Klay Thompson", "Draymond Green"
      )
      , Captain = c( TRUE, FALSE, FALSE, FALSE )
      , stringsAsFactors = FALSE
    )
  )

非 apply() 系列方法

# cbind each data frame within the list.of.all.stars
# to its corresponding row in team.df
team.and.all.stars.list.of.df <-
  list(
    cbind(
      df[ 1, ]
      , list.of.all.stars[[1]]
    )
    ,   cbind(
      df[ 2, ]
      , list.of.all.stars[[2]]
    )
  )
# Warning messages:
#   1: In data.frame(..., check.names = FALSE) :
#   row names were found from a short variable and have been discarded
# 2: In data.frame(..., check.names = FALSE) :
#   row names were found from a short variable and have been discarded

# collapse each list
# into data frame
final.df <-
  data.frame(
    do.call(
      what = "rbind"
      , args = team.and.all.stars.list.of.df
    )
    , stringsAsFactors = FALSE
  )
# view final output
final.df
# Team_Name Team_Location         Player Captain
# 1 Cavaliers Cleveland, OH   LeBron James    TRUE
# 2 Cavaliers Cleveland, OH     Kevin Love   FALSE
# 3  Warriors   Oakland, CA  Stephen Curry    TRUE
# 4  Warriors   Oakland, CA   Kevin Durant   FALSE
# 5  Warriors   Oakland, CA  Klay Thompson   FALSE
# 6  Warriors   Oakland, CA Draymond Green   FALSE

# end of script #

mapply() 尝试失败

# Hoping to Apply A Function
# using a data frame and
# a list of data frames
mapply.method <-
  mapply(
    FUN = function( x, y )
      cbind.data.frame(
        x
        , y
        , stringsAsFactors = FALSE
      )
    , team.df
    , list.of.all.stars
  )

# view results
mapply.method
#         Team_Name   Team_Location
# x       Character,2 Character,4  
# Player  Character,2 Character,4  
# Captain Logical,2   Logical,4 

# end of script #

【问题讨论】:

  • 使用apply函数吗?您是否可以控制 data.frames 列表的结构?
  • @SymbolixAU 我愿意,但我愿意学习新方法! list.of.all.stars 中的对象顺序与team.df 中的行顺序相关联。这能回答你的问题吗?

标签: r list dataframe apply mapply


【解决方案1】:

关于在Map/mapply 'team.df' 中使用'team.df' 作为输入的OP 方法是data.frame,它是list 的列。因此,基本输入是vector 的列。它循环遍历vector 或列,而不是整个数据集或行(基于所需的输出)。为了防止这种情况,如果我们用list 包装,它是一个单独的单元,它循环到'list.of.all.stars' 的每个list 元素

do.call(rbind, Map(cbind, list(team.df), list.of.all.stars))

根据预期的输出,“team.df”的每一行都应该有对应的“list.of.all.stars”的list元素。在这种情况下,split 'team.df' 按行并执行cbind

res <- do.call(rbind, Map(cbind,  split(team.df, seq_len(nrow(team.df))), list.of.all.stars))
row.names(res) <- NULL
res
#   Team_Name Team_Location         Player Captain
#1 Cavaliers Cleveland, OH   LeBron James    TRUE
#2 Cavaliers Cleveland, OH     Kevin Love   FALSE
#3  Warriors   Oakland, CA  Stephen Curry    TRUE
#4  Warriors   Oakland, CA   Kevin Durant   FALSE
#5  Warriors   Oakland, CA  Klay Thompson   FALSE
#6  Warriors   Oakland, CA Draymond Green   FALSE

我们也可以在tidyverse 中执行此操作。按“team.df”中的所有列分组后,nest 它创建一个“数据”的基本列表(长度为 2),将“数据”分配给“list.of.all.stars” mutateunnest list

library(tidyverse)
team.df %>% 
      group_by_all() %>%
      nest %>% 
      mutate(data = list.of.all.stars) %>% 
      unnest
# A tibble: 6 x 4
#  Team_Name Team_Location Player         Captain
#  <chr>     <chr>         <chr>          <lgl>  
# 1 Cavaliers Cleveland, OH LeBron James   T      
# 2 Cavaliers Cleveland, OH Kevin Love     F      
# 3 Warriors  Oakland, CA   Stephen Curry  T      
# 4 Warriors  Oakland, CA   Kevin Durant   F      
# 5 Warriors  Oakland, CA   Klay Thompson  F      
# 6 Warriors  Oakland, CA   Draymond Green F      

【讨论】:

  • 我认为您的第一种方法给出的结果不正确。例如,斯蒂芬库里在克利夫兰连续(考虑到问题的编辑和所需的输出)
  • @SymbolixAU 是的,我更新了描述。只是指定它如何循环。
  • 是的,我现在看到了。
  • @akrun 感谢您提供基本rtidyverse 的解释和有用的示例。感谢您帮助我更好地理解mapply
【解决方案2】:

鉴于对问题的编辑和所需的输出,我会纯粹使用 data.table

library(data.table)

## combine the list of all stars into one data.table
## creating an 'id' column 
dt_players <- rbindlist(list.of.all.stars, idcol = T)

## we can keep/use the row names as the order of the data 
## is consistent with the list elements 
dt_teams <- as.data.table(team.df, keep.rownames = T)
dt_teams[, rn := as.integer(rn)]

## use a join to combine the data to get the desired result. 
dt_teams[
  dt_players
  , on = c(rn = ".id")
]

#    rn Team_Name Team_Location         Player Captain
# 1:  1 Cavaliers Cleveland, OH   LeBron James    TRUE
# 2:  1 Cavaliers Cleveland, OH     Kevin Love   FALSE
# 3:  2  Warriors   Oakland, CA  Stephen Curry    TRUE
# 4:  2  Warriors   Oakland, CA   Kevin Durant   FALSE
# 5:  2  Warriors   Oakland, CA  Klay Thompson   FALSE
# 6:  2  Warriors   Oakland, CA Draymond Green   FALSE

旧答案

这个方法使用data.table 来做实际的工作,但是我给你一个sapply 方法来获取扩展team.df 数据框的行数。

还假设team.df中的队伍顺序与list.of.all.starts中的玩家顺序一致(即data.frame的行对应列表元素)

library(data.table)

## grab the rows of each data.frame
reps <- sapply(list.of.all.stars, nrow)

## replace the rows of the data.frame
setDT(team.df)[rep(1:.N, reps), ]

#    Team_Name Team_Location
# 1: Cavaliers Cleveland, OH
# 2: Cavaliers Cleveland, OH
# 3:  Warriors   Oakland, CA
# 4:  Warriors   Oakland, CA
# 5:  Warriors   Oakland, CA
# 6:  Warriors   Oakland, CA

如果您不想使用data.table,可以将相同的方法应用于data.frame

team.df[rep(row.names(team.df), reps), ]
#     Team_Name Team_Location
# 1   Cavaliers Cleveland, OH
# 1.1 Cavaliers Cleveland, OH
# 2    Warriors   Oakland, CA
# 2.1  Warriors   Oakland, CA
# 2.2  Warriors   Oakland, CA
# 2.3  Warriors   Oakland, CA

或者使用类似的概念,但都在lapply

lst <- lapply(seq_along(list.of.all.stars), function(x) {
  df <- team.df[x, ]
  df[rep(row.names(df), nrow(list.of.all.stars[[x]])), ]
})

do.call(rbind, lst)
#     Team_Name Team_Location
# 1   Cavaliers Cleveland, OH
# 1.1 Cavaliers Cleveland, OH
# 2    Warriors   Oakland, CA
# 2.1  Warriors   Oakland, CA
# 2.2  Warriors   Oakland, CA
# 2.3  Warriors   Oakland, CA

【讨论】:

  • 感谢您的回答,@SymbolixAU!我必须开始学习data.table。您是不到一周内第二个将 10 多行代码压缩成一行代码的人。感谢您的帮助,并感谢您根据以前版本的问题添加了如何扩展 data.frame 的示例。
  • @aspiringurbandatascientist 这是一个非常有用的软件包——我每天在工作中使用它。在较小的数据集上,您不会看到它和 tidyverse 之间有太多好处 - 这将归结为您喜欢哪种语法的个人偏好。但是,在 10+ 百万行上,它将开始获得回报。
猜你喜欢
  • 1970-01-01
  • 2022-12-01
  • 2020-12-20
  • 1970-01-01
  • 2022-01-12
  • 1970-01-01
  • 2020-09-23
  • 2020-07-24
  • 2020-04-27
相关资源
最近更新 更多