根据值合并列名以创建另一列答案

【问题标题】：Merging the column names based on the values to create another column根据值合并列名以创建另一列
【发布时间】：2021-05-30 06:40:54
【问题描述】：

我有一个包含各种电影类型以及电影是否属于该类型的电影数据集。例如

Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0

如果电影属于那种类型，我想获得一个新列，其中电影类型名称用空格或逗号分隔，例如

Index  New column
0    Comedy Drama Family
1    Comedy Family
2    Drama
3    Comedy
4    Comedy Drama
5    Crime Drama

请分享 R 或 Python 中的代码。感谢您的帮助。

【问题讨论】：

如果您对答案感到满意，请不要忘记accept它 - 单击答案旁边的复选标记 (✓) 将其从灰色切换为已填充。
也许他们只是喜欢这个问题

标签： python r pandas dataframe data-preprocessing

【解决方案1】：

在 Python 中使用矩阵乘法：

df.dot(df.columns + " ")

得到

Index
0    Comedy Drama Family
1          Comedy Family
2                  Drama
3                 Comedy
4           Comedy Drama
5            Crime Drama
6                 Comedy

使其更通用：

sep = ", "
df.dot(df.columns + sep).str.rstrip(sep)

即，将分隔符添加到列名，执行矩阵向量乘法，然后在末尾右剥离分隔符。

【讨论】：

这太棒了！

【解决方案2】：

df %>%
  apply(1, function(x){which(x == 1)}) %>% 
  lapply(function(x){
    paste(names(x), collapse = " ")
    }) %>%
  unlist() -> df$your_new_column

【讨论】：

【解决方案3】：

my.movies <- read.table(text = 'Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0', header = T)
library(tidyverse)
my.movies %>%
  pivot_longer(!Index, names_to = 'genre') %>%
  filter(value !=0) %>%
  group_by(Index) %>%
  summarise(genre = toString(genre))
#> # A tibble: 7 x 2
#>   Index genre                
#>   <int> <chr>                
#> 1     0 Comedy, Drama, Family
#> 2     1 Comedy, Family       
#> 3     2 Drama                
#> 4     3 Comedy               
#> 5     4 Comedy, Drama        
#> 6     5 Crime, Drama         
#> 7     6 Comedy

^{由reprex package (v2.0.0) 于 2021 年 5 月 30 日创建}

【讨论】：

【解决方案4】：

基础 R -

df$new_col <- apply(df, 1, function(x) paste0(names(x)[x == 1], collapse = ' '))

dplyr-

library(dplyr)

df %>%
  group_by(Index) %>%
  summarise(new_col = paste0(names(.[-1])[cur_data() == 1], collapse = ' '))

#  Index new_col            
#  <int> <chr>              
#1     0 Comedy Drama Family
#2     1 Comedy Family      
#3     2 Drama              
#4     3 Comedy             
#5     4 Comedy Drama       
#6     5 Crime Drama        
#7     6 Comedy

数据

df <- structure(list(Index = 0:6, Biography = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L), Comedy = c(1L, 1L, 0L, 1L, 1L, 0L, 1L), Crime = c(0L, 
0L, 0L, 0L, 0L, 1L, 0L), Documentary = c(0L, 0L, 0L, 0L, 0L, 
0L, 0L), Drama = c(1L, 0L, 1L, 0L, 1L, 1L, 0L), Family = c(1L, 
1L, 0L, 0L, 0L, 0L, 0L), Fantasy = c(0L, 0L, 0L, 0L, 0L, 0L, 
0L)), class = "data.frame", row.names = c(NA, -7L))

【讨论】：

【解决方案5】：

基本python代码：

import pandas as pd
df = pd.read_csv('test.csv')

def check_genre(row):
    s = ""
    if row['biography'] == 1:
        s = s + ' biography'
    if row['comedy'] == 1:
        s = s + ' comedy'
    if row['crime'] == 1:
        s = s + ' crime'
    if row['Documentary'] == 1:
        s = s + ' Documentary'
    if row['Drama'] == 1:
        s = s + ' Drama'
    if row['Family'] == 1:
        s = s + ' Family'
    if row['Fantasy'] == 1:
        s = s + ' Fantasy'

    return s

df['genre'] = df.apply(lambda row: check_genre(row), axis=1)

print(df)

【讨论】：

【解决方案6】：

在pandas中，你可以提取等于1的行值的索引值，然后将它们转换为字符串：

df.apply(lambda row: " ".join(row[row == 1].index), axis=1)

# Index
# 0    Comedy Drama Family
# 1          Comedy Family
# 2                  Drama
# 3                 Comedy
# 4           Comedy Drama
# 5            Crime Drama
# 6                 Comedy

【讨论】：

【解决方案7】：

在 R/dplyr 中发布响应

如果“main_df”是第一张图片中的 DataFrame。使数据框更长，以便所有流派列的格式都整齐。 group_by 基于索引，因为这是每部电影，并使用 paste 折叠流派列

main_df%>%
  pivot_longer(cols=-index)%>%
  filter(value>0)%>% # filter where movie is part of the genre i.e 1
  group_by(index)%>%
  mutate(new_genre = paste(name,collapse = ","))%>%
  ungroup()%>%
  distinct(index,new_genre)-> main_df2

# if you want to merge back to the original data frame use left_join

left_join(main_df, main_df2,by="index")

【讨论】：

【解决方案8】：

减少到一个

取消堆叠
过滤器
聚合

import io

df = pd.read_csv(io.StringIO("""Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0"""), sep="\s+").set_index("Index")

df.unstack().to_frame().loc[lambda d: d[0].eq(1)].reset_index().groupby("Index").agg({"level_0":" ".join})

Index	level_0
0	Comedy Drama Family
1	Comedy Family
2	Drama
3	Comedy
4	Comedy Drama
5	Crime Drama
6	Comedy

【讨论】：