【问题标题】:Counting number occurrences of certain words in entire CSV file as well as per row in Python计算整个 CSV 文件以及 Python 中每行中某些单词的出现次数
【发布时间】:2021-01-11 19:27:41
【问题描述】:

我正在处理来自多个服务器的数据并为每个服务器生成一个 CSV 文件。我已经设法将来自所有服务器的数据编译到一个文件中,合并文件中的数据如下-

Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01
1.1 Database Placement,PASSED,PASSED,PASSED
1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED
1.3 Diable MySQL history,PASSED,PASSED,FAILED
2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA

上述文件中的每个服务器列都可以有结果值,以下任一-

["PASSED","FAILED","EXCEPTION","NA","DEPRECATED"]

从上面的 CSV 文件中,我想计算结果并创建一个如下所示的数据集

Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01,PASSED,FAILED,EXCEPTION,NA,DEPRECATED
1.1 Database Placement,PASSED,PASSED,PASSED,3,0,0,0,0
1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED,3,0,0,0,0
1.3 Diable MySQL history,PASSED,PASSED,FAILED,2,1,0,0,0
2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA,1,0,0,1,1

【问题讨论】:

    标签: python dataframe csv dataset counter


    【解决方案1】:

    这是一个建议(相当冗长以突出正在发生的事情):

    import csv
    
    events = ["PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED"]
    
    # Open files
    with open('data.csv', 'r') as csv_in, open('data_out.csv', 'w') as csv_out:
    
        # Initialize csv-reader and -writer
        csv_reader, csv_writer = csv.reader(csv_in), csv.writer(csv_out)
    
        # Process header
        line_in = next(csv_reader)
        line_out = line_in + events
        csv_writer.writerow(line_out)
    
        # Process data
        for line_in in csv_reader:
            line_out = line_in
            for event in events:
                line_out += [sum(1 if event == entry else 0
                             for entry in line_in[1:])]
            csv_writer.writerow(line_out)
    
    

    我假设您的数据位于名为 data.csv 的文件中。你必须调整它。我希望它有效...

    PS:您的示例数据中有一个拼写错误:DEPRICATED 应该是 DEPRECATED。这会导致非预期的输出。

    没有不必要的辅助变量的更紧凑的版本如下所示:

    import csv
    
    events = ["PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED"]
    with open('data.csv', 'r') as fin, open('data_out.csv', 'w') as fout:
        in_, out = csv.reader(fin), csv.writer(fout)
        out.writerow(next(in_) + events)
        out.writerows(line + [sum(1 if event == entry else 0 for entry in line[1:])
                              for event in events]
                      for line in in_)
    

    【讨论】:

      【解决方案2】:

      您可以使用Counter 来计算特定单词的出现次数。假设您已经打开了 .csv 文件并存储在字符串 input 中:您可以这样做:

      from collections import Counter
      
      res_values = ("PASSED", "FAILED", "EXCEPTION", "NA", "DEPRECATED")
      
      input = ("Description,dc1pp1sellv01,dc1pp2sellv01,dc2pp1sellv01\n"
               "1.1 Database Placement,PASSED,PASSED,PASSED\n"
               "1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED\n"
               "1.3 Diable MySQL history,PASSED,PASSED,FAILED\n"
               "2.1 Ensure old passwords is set to 1,PASSED,DEPRICATED,NA")
      
      print('\n'.join(
          [line + ',' + ','.join(
              [str(Counter(line.split(','))[res])
               if i != 0
               else res
               for res in res_values]
          )
           for i, line in enumerate(input.split('\n'))]))
      

      我使用列表推导来更好地优化流程(因为文件可能非常大),但这里有另一个更清晰的代码,它做同样的事情:

      split = input.split('\n')                      # Split the input line by line
      for i, line in enumerate(split):               # For each line of the input
          if i == 0:                                 # Write full result name (for the first line)
              split[i] += ',' + ','.join(res_values)
          else:                                      # Count and write result occurrences
              counts = Counter(line.split(','))
              for res in res_values:
                  split[i] += ',' + str(counts[res])
      print('\n'.join(split))                        # Join the full string
      

      我提出了一个准备执行的解决方案,但出于优化目的,它当然比将文件存储在像这里这样的字符串变量中更好。

      【讨论】:

        猜你喜欢
        • 2018-08-25
        • 1970-01-01
        • 2023-04-04
        • 1970-01-01
        • 2021-04-09
        • 1970-01-01
        • 2019-04-05
        • 2019-09-01
        • 1970-01-01
        相关资源
        最近更新 更多