【问题标题】:GROUP BY CSV columns in bashbash 中的 GROUP BY CSV 列
【发布时间】:2020-06-07 22:19:22
【问题描述】:

我正在使用 bash 中的 .csv 文件,我需要根据前面的字段对每行的最后一个值求和。也就是说,我需要在 Bash 中按前三列进行分组。

输入文件示例:

Barcelona, Female, suspect, 2
Barcelona, Female, positive, 3
Barcelona, Female, positive, 2
Barcelona, Male, positive, 1
Barcelona, Female, suspect, 5
Madrid, Male, positive, 3
Madrid, Male, positive, 1
Barcelona, Male, positive, 4
Madrid, Female, suspect, 2

输出文件示例:

Barcelona, Female, suspect, 7
Barcelona, Female, positive, 5
Barcelona, Male, positive, 5
Barcelona, Female, suspect, 5
Madrid, Male, positive, 4
Madrid, Female, suspect, 2


【问题讨论】:

  • Barcelona, Female, suspect, 7 Barcelona, Female, suspect, 5 在输出文件中重复 - 我想 5 的行不应该在那里。那你试过什么?
  • 您尝试过任何解决方案吗?进展如何?在堆栈交换中,在您提出问题并得到答案后,您应该始终返回反馈

标签: bash csv sum aggregate


【解决方案1】:

GNU datamash 专为此类任务而设计:

datamash -t, -sg1,2,3 sum 4 < input.csv

或者awk:

awk -F, '{ groups[$1 "," $2 "," $3] += $4}
         END { PROCINFO["sorted_in"] = "@ind_str_asc" # Sort output in GNU awk
               for (g in groups) print g "," groups[g] }' input.csv

【讨论】:

    【解决方案2】:

    使用 Miller (https://github.com/johnkerl/miller) 并运行

    mlr --csv -N stats1 -a sum -f 4 -g 1,2,3 input.csv
    

    你有

    Barcelona, Female, suspect,7
    Barcelona, Female, positive,5
    Barcelona, Male, positive,5
    Madrid, Male, positive,4
    Madrid, Female, suspect,2
    

    【讨论】:

      【解决方案3】:

      我制作了这个脚本:

      #!/bin/bash
      
      # Barcelona, Female, Suspect
      bfs() {
              BFS_FILTER=$(egrep -i "[Bb]arcelona, [Ff]emale, [Ss]uspect" data.csv | awk '{ print $4 }')
              BFS_CASES=$(for ITEM in ${BFS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
              BFS_SUM=$(if [ $(echo $BFS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BFS_CASES |wc -w) -eq 2 ]; then echo $BFS_CASES |awk '{ print $1 }'; else expr ${BFS_CASES}; fi; fi )
      
              echo "Barcelona, Female, suspect, $BFS_SUM"
      }
      
      # Barcelona, Male, Suspect
      bms() {
              BMS_FILTER=$(egrep -i "[Bb]arcelona, [Mm]ale, [Ss]uspect" data.csv | awk '{ print $4 }')
              BMS_CASES=$(for ITEM in ${BMS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
              BMS_SUM=$(if [ $(echo $BMS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BMS_CASES |wc -w) -eq 2 ]; then echo $BMS_CASES |awk '{ print $1 }'; else expr ${BMS_CASES}; fi; fi )
      
              echo "Barcelona, Male, suspect, $BMS_SUM"
      }
      
      # Barcelona, Female, Positive
      bfp() {
              BFP_FILTER=$(egrep -i "[Bb]arcelona, [Ff]emale, [Pp]ositive" data.csv | awk '{ print $4 }')
              BFP_CASES=$(for ITEM in ${BFP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
              BFP_SUM=$(if [ $(echo $BFP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BFP_CASES |wc -w) -eq 2 ]; then echo $BFP_CASES |awk '{ print $1 }'; else expr ${BFP_CASES}; fi; fi )
      
              echo "Barcelona, Female, positive, $BFP_SUM"
      }
      
      # Barcelona, Male, Positive
      bmp() {
              BMP_FILTER=$(grep -i "[Bb]arcelona, [Mm]ale, [Pp]ositive" data.csv | awk '{ print $4 }')
              BMP_CASES=$(for ITEM in ${BMP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
              BMP_SUM=$(if [ $(echo $BMP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $BMP_CASES |wc -w) -eq 2 ]; then echo $BMP_CASES |awk '{ print $1 }'; else expr ${BMP_CASES}; fi; fi )
      
              echo "Barcelona, Male, positive, $BMP_SUM"
      }
      
      # Madrid, Female, Suspect
      mfs() {
              MFS_FILTER=$(egrep -i "[Mm]adrid, [Ff]emale, [Ss]uspect" data.csv | awk '{ print $4 }')
              MFS_CASES=$(for ITEM in ${MFS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
              MFS_SUM=$(if [ $(echo $MFS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MFS_CASES |wc -w) -eq 2 ]; then echo $MFS_CASES |awk '{ print $1 }'; else expr ${MFS_CASES}; fi; fi )
      
              echo "Madrid, Female, suspect, $MFS_SUM"
      }
      
      # Madrid, Male, Suspect
      mms() {
              MMS_FILTER=$(egrep -i "[Mm]adrid, [Mm]ale, [Ss]uspect" data.csv | awk '{ print $4 }')
              MMS_CASES=$(for ITEM in ${MMS_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
              MMS_SUM=$(if [ $(echo $MMS_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MMS_CASES |wc -w) -eq 2 ]; then echo $MMS_CASES |awk '{ print $1 }'; else expr ${MMS_CASES}; fi; fi )
      
              echo "Madrid, Male, suspect, $MMS_SUM"
      }
      
      # Madrid, Female, Positive
      mfp() {
              MFP_FILTER=$(egrep -i "[Mm]adrid, [Ff]emale, [Pp]ositive" data.csv | awk '{ print $4 }')
              MFP_CASES=$(for ITEM in ${MFP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
              MFP_SUM=$(if [ $(echo $MFP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MFP_CASES |wc -w) -eq 2 ]; then echo $MFP_CASES |awk '{ print $1 }'; else expr ${MFP_CASES}; fi; fi )
      
              echo "Madrid, Female, positive, $MFP_SUM"
      }
      
      # Madrid, Male, Positive
      mmp() {
              MMP_FILTER=$(grep -i "[Mm]adrid, [Mm]ale, [Pp]ositive" data.csv | awk '{ print $4 }')
              MMP_CASES=$(for ITEM in ${MMP_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |g" |sed "s/\+ $//g")
              MMP_SUM=$(if [ $(echo $MMP_CASES |egrep -v ^$ |wc -l) -eq 0 ]; then echo 0; else if [ $(echo $MMP_CASES |wc -w) -eq 2 ]; then echo $MMP_CASES |awk '{ print $1 }'; else expr ${MMP_CASES}; fi; fi )
      
              echo "Madrid, Male, positive, $MMP_SUM"
      }
      
      bfs
      bms
      bfp
      bmp
      mfs
      mms
      mfp
      mmp
      

      它基于从我称为data.csv 的文件中提取的数据,因此您必须将其上的每个data.csv 巧合替换为您的文件名运行它以使其工作。例如,如果您将其保存为script.sh,并且您的输入文件名为yourfile.csv,您可以尝试:

      sed -i "s|data.csv|yourfile.csv|g" script.sh
      

      将每个输出分成不同的函数的原因是,如果您想随时过滤掉其中的一些,您可以毫不费力地做到这一点,只需注释函数的名称即可。此外,您可以通过简单地复制粘贴架构并正确替换新变量名称来添加额外的案例。

      现在解释一下:

      数字案例过滤器的行是通过:

      egrep -i "$CITY, $GENRE, $STATUS" data.csv | awk '{ print $4 }'

      然后,找到的每一行的总和如下所示:

      for ITEM in ${SOME_FILTER}; do echo $ITEM; done |tr \\n " " |sed "s| | + |" |sed "s/\+ $//g"

      如果有超过 2 行,它会在哪里输出 expr $NUM + $NUM ...。但是如果没有(我检测到有些情况只在一行中显示正数的情况),我用这种方法过滤它们:

      • 如果只有一行有多个案例,只需打印它
      • 如果有多行,对每一行求和
      • 如果没有案例但调用了函数并且存在行,则打印 0

      最后,当没有检测到病例时,它也会打印一行,只是通过echo 0。示例输出:

      Barcelona, Female, suspect, 7
      Barcelona, Male, suspect, 0
      Barcelona, Female, positive, 5
      Barcelona, Male, positive, 5
      Madrid, Female, suspect, 2
      Madrid, Male, suspect, 0
      Madrid, Female, positive, 0
      Madrid, Male, positive, 4
      

      希望你觉得它有用。

      【讨论】:

        猜你喜欢
        • 2020-02-08
        • 1970-01-01
        • 2019-10-09
        • 2021-12-17
        • 2011-10-21
        • 1970-01-01
        • 2011-01-12
        • 2014-08-29
        • 1970-01-01
        相关资源
        最近更新 更多