【问题标题】:Splitting/reading CSV file by distinct row按不同行拆分/读取 CSV 文件
【发布时间】:2020-06-12 20:03:36
【问题描述】:

我有一个包含 3 列的 csv 文件。

Key,Branch,Account 
a,213,234567
a,454,457900
a,562,340094
a,200,456704
b,400,850988
b,590,344433
c,565,678635
c,300,453432
c,555,563546
c,001,660905

我想遍历每一行并从 Key 列(a、b 和 c)中获取不同的行,并将它们拆分为 3 个不同的 pyspark 数据报。

   a,213,234567
   a,454,457900
   a,562,340094
   a,200,456704


   b,400,850988
   b,590,344433


   c,565,678635
   c,300,453432
   c,555,563546
   c,001,660905

【问题讨论】:

  • 输出是否正确? a 有 4 行,但输出有 3 行。
  • 如果您尝试将不同的数据帧保存为文件系统中的不同文件,请在此处查看我的答案,stackoverflow.com/questions/60048027/…。 python/pandas 解决方案不适用于大数据。
  • 好的,有什么问题吗?你真的尝试过什么,做过什么研究吗?

标签: python csv dataframe pyspark


【解决方案1】:

你可以使用 pandas 库来做同样的事情,它还可以让你用最少的代码做更多的操作。 请阅读熊猫here

这是获得所需输出的代码。我将数据存储在字典中,因此您可以使用 dict[key] ex 获取所需的数据。 dict[a]

import pandas

df = pandas.read_csv("data.csv", delimiter=",")

keys = df["Key"].unique() #This will provide all unique keys from csv

sorted_DF = df.groupby("Key") #Sort data based on value of column Key

dict = {} #To store data based on key
for key in keys:
    dict[key] = sorted_DF.get_group(key).values.tolist()

for key in keys:
    print("{} : {}".format(key, dict[key]))

输出

a : [['a', 213, 234567], ['a', 454, 457900], ['a', 562, 340094], ['a', 200, 456704]]

b : [['b', 400, 850988], ['b', 590, 344433]]

c : [['c', 565, 678635], ['c', 300, 453432], ['c', 555, 563546], ['c', 1, 660905]]

【讨论】:

    【解决方案2】:

    这样的?

    csv_string = """Key,Branch,Account 
    a,213,234567
    a,454,457900
    a,562,340094
    a,200,456704
    b,400,850988
    b,590,344433
    c,565,678635
    c,300,453432
    c,555,563546
    c,001,660905"""
    
    import csv
    import io
    
    #
    # 1. Parse csv_string into a list of ordereddicts
    #
    
    def parse_csv(string):
        # if you are reading from a file you don't need to do this
        # StringIO nonsense -- just pass the file to csv.DictReader()
        string_file = io.StringIO(string)
        reader = csv.DictReader(string_file)
        return list(reader)
    
    csv_table = parse_csv(csv_string)
    
    #
    # 2. Loop through each line of the table and get the key
    #  - If we have seen the key before, put the line in the list
    #    with other lines that had the same key
    #  - If not, start a new list for that key
    #
    
    result = {}
    
    for line in csv_table:
        key = line["Key"].strip()
        print(key, ":", line)
        if key in result:
            result[key].append(line)
        else:
            result[key] = [line]
    
    #
    # 3. Finally, print the result.
    # The lines will probably be easier to deal with if you keep them 
    # in their parsed form, but for readability we can join the values
    # of the line back into a string with commas
    #
    
    print(result)
    print("")
    
    for key_list in result.values():
        for line in key_list:
            print(",".join(line.values()))
        print("")
    
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-03-23
      • 1970-01-01
      • 2014-02-01
      • 1970-01-01
      相关资源
      最近更新 更多