【问题标题】:Merging the column names based on the values to create another column根据值合并列名以创建另一列
【发布时间】:2021-05-30 06:40:54
【问题描述】:

我有一个包含各种电影类型以及电影是否属于该类型的电影数据集。例如

Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
0   0   1   0   0   1   1   0
1   0   1   0   0   0   1   0
2   0   0   0   0   1   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   1   0   0
5   0   0   1   0   1   0   0
6   0   1   0   0   0   0   0

如果电影属于那种类型,我想获得一个新列,其中电影类型名称用空格或逗号分隔,例如

Index  New column
0    Comedy Drama Family
1    Comedy Family
2    Drama
3    Comedy
4    Comedy Drama
5    Crime Drama

请分享 R 或 Python 中的代码。 感谢您的帮助。

【问题讨论】:

  • 如果您对答案感到满意,请不要忘记accept它 - 单击答案旁边的复选标记 (✓) 将其从灰色切换为已填充。
  • 也许他们只是喜欢这个问题

标签: python r pandas dataframe data-preprocessing


【解决方案1】:

在 Python 中使用矩阵乘法:

df.dot(df.columns + " ")

得到

Index
0    Comedy Drama Family
1          Comedy Family
2                  Drama
3                 Comedy
4           Comedy Drama
5            Crime Drama
6                 Comedy

使其更通用:
sep = ", "
df.dot(df.columns + sep).str.rstrip(sep)

即,将分隔符添加到列名,执行矩阵向量乘法,然后在末尾右剥离分隔符。

【讨论】:

  • 这太棒了!
【解决方案2】:
df %>%
  apply(1, function(x){which(x == 1)}) %>% 
  lapply(function(x){
    paste(names(x), collapse = " ")
    }) %>%
  unlist() -> df$your_new_column

【讨论】:

    【解决方案3】:
    my.movies <- read.table(text = 'Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
    0   0   1   0   0   1   1   0
    1   0   1   0   0   0   1   0
    2   0   0   0   0   1   0   0
    3   0   1   0   0   0   0   0
    4   0   1   0   0   1   0   0
    5   0   0   1   0   1   0   0
    6   0   1   0   0   0   0   0', header = T)
    library(tidyverse)
    my.movies %>%
      pivot_longer(!Index, names_to = 'genre') %>%
      filter(value !=0) %>%
      group_by(Index) %>%
      summarise(genre = toString(genre))
    #> # A tibble: 7 x 2
    #>   Index genre                
    #>   <int> <chr>                
    #> 1     0 Comedy, Drama, Family
    #> 2     1 Comedy, Family       
    #> 3     2 Drama                
    #> 4     3 Comedy               
    #> 5     4 Comedy, Drama        
    #> 6     5 Crime, Drama         
    #> 7     6 Comedy
    

    reprex package (v2.0.0) 于 2021 年 5 月 30 日创建

    【讨论】:

      【解决方案4】:

      基础 R -

      df$new_col <- apply(df, 1, function(x) paste0(names(x)[x == 1], collapse = ' '))
      

      dplyr-

      library(dplyr)
      
      df %>%
        group_by(Index) %>%
        summarise(new_col = paste0(names(.[-1])[cur_data() == 1], collapse = ' '))
      
      #  Index new_col            
      #  <int> <chr>              
      #1     0 Comedy Drama Family
      #2     1 Comedy Family      
      #3     2 Drama              
      #4     3 Comedy             
      #5     4 Comedy Drama       
      #6     5 Crime Drama        
      #7     6 Comedy             
      

      数据

      df <- structure(list(Index = 0:6, Biography = c(0L, 0L, 0L, 0L, 0L, 
      0L, 0L), Comedy = c(1L, 1L, 0L, 1L, 1L, 0L, 1L), Crime = c(0L, 
      0L, 0L, 0L, 0L, 1L, 0L), Documentary = c(0L, 0L, 0L, 0L, 0L, 
      0L, 0L), Drama = c(1L, 0L, 1L, 0L, 1L, 1L, 0L), Family = c(1L, 
      1L, 0L, 0L, 0L, 0L, 0L), Fantasy = c(0L, 0L, 0L, 0L, 0L, 0L, 
      0L)), class = "data.frame", row.names = c(NA, -7L))
      

      【讨论】:

        【解决方案5】:

        基本python代码:

        import pandas as pd
        df = pd.read_csv('test.csv')
        
        def check_genre(row):
            s = ""
            if row['biography'] == 1:
                s = s + ' biography'
            if row['comedy'] == 1:
                s = s + ' comedy'
            if row['crime'] == 1:
                s = s + ' crime'
            if row['Documentary'] == 1:
                s = s + ' Documentary'
            if row['Drama'] == 1:
                s = s + ' Drama'
            if row['Family'] == 1:
                s = s + ' Family'
            if row['Fantasy'] == 1:
                s = s + ' Fantasy'
        
            return s
        
        df['genre'] = df.apply(lambda row: check_genre(row), axis=1)
        
        print(df)
        

        【讨论】:

          【解决方案6】:

          在pandas中,你可以提取等于1的行值的索引值,然后将它们转换为字符串:

          df.apply(lambda row: " ".join(row[row == 1].index), axis=1)
          
          # Index
          # 0    Comedy Drama Family
          # 1          Comedy Family
          # 2                  Drama
          # 3                 Comedy
          # 4           Comedy Drama
          # 5            Crime Drama
          # 6                 Comedy
          

          【讨论】:

            【解决方案7】:

            在 R/dplyr 中发布响应

            如果“main_df”是第一张图片中的 DataFrame。 使数据框更长,以便所有流派列的格式都整齐。 group_by 基于索引,因为这是每部电影,并使用 paste 折叠流派列

            main_df%>%
              pivot_longer(cols=-index)%>%
              filter(value>0)%>% # filter where movie is part of the genre i.e 1
              group_by(index)%>%
              mutate(new_genre = paste(name,collapse = ","))%>%
              ungroup()%>%
              distinct(index,new_genre)-> main_df2
            
            # if you want to merge back to the original data frame use left_join
            
            left_join(main_df, main_df2,by="index")
            

            【讨论】:

              【解决方案8】:

              减少到一个

              • 取消堆叠
              • 过滤器
              • 聚合
              import io
              
              df = pd.read_csv(io.StringIO("""Index Biography Comedy  Crime   Documentary Drama   Family  Fantasy
              0   0   1   0   0   1   1   0
              1   0   1   0   0   0   1   0
              2   0   0   0   0   1   0   0
              3   0   1   0   0   0   0   0
              4   0   1   0   0   1   0   0
              5   0   0   1   0   1   0   0
              6   0   1   0   0   0   0   0"""), sep="\s+").set_index("Index")
              
              df.unstack().to_frame().loc[lambda d: d[0].eq(1)].reset_index().groupby("Index").agg({"level_0":" ".join})
              
              Index level_0
              0 Comedy Drama Family
              1 Comedy Family
              2 Drama
              3 Comedy
              4 Comedy Drama
              5 Crime Drama
              6 Comedy

              【讨论】:

                猜你喜欢
                • 1970-01-01
                • 2023-02-23
                • 2020-03-22
                • 2021-12-02
                • 1970-01-01
                • 1970-01-01
                • 1970-01-01
                • 1970-01-01
                • 1970-01-01
                相关资源
                最近更新 更多