按不同行拆分/读取 CSV 文件答案

【问题标题】：Splitting/reading CSV file by distinct row按不同行拆分/读取 CSV 文件
【发布时间】：2020-06-12 20:03:36
【问题描述】：

我有一个包含 3 列的 csv 文件。

Key,Branch,Account 
a,213,234567
a,454,457900
a,562,340094
a,200,456704
b,400,850988
b,590,344433
c,565,678635
c,300,453432
c,555,563546
c,001,660905

我想遍历每一行并从 Key 列（a、b 和 c）中获取不同的行，并将它们拆分为 3 个不同的 pyspark 数据报。

   a,213,234567
   a,454,457900
   a,562,340094
   a,200,456704


   b,400,850988
   b,590,344433


   c,565,678635
   c,300,453432
   c,555,563546
   c,001,660905

【问题讨论】：

输出是否正确？ a 有 4 行，但输出有 3 行。
如果您尝试将不同的数据帧保存为文件系统中的不同文件，请在此处查看我的答案，stackoverflow.com/questions/60048027/…。 python/pandas 解决方案不适用于大数据。
好的，有什么问题吗？你真的尝试过什么，做过什么研究吗？

标签： python csv dataframe pyspark

【解决方案1】：

你可以使用 pandas 库来做同样的事情，它还可以让你用最少的代码做更多的操作。请阅读熊猫here

这是获得所需输出的代码。我将数据存储在字典中，因此您可以使用 dict[key] ex 获取所需的数据。 dict[a]

import pandas

df = pandas.read_csv("data.csv", delimiter=",")

keys = df["Key"].unique() #This will provide all unique keys from csv

sorted_DF = df.groupby("Key") #Sort data based on value of column Key

dict = {} #To store data based on key
for key in keys:
    dict[key] = sorted_DF.get_group(key).values.tolist()

for key in keys:
    print("{} : {}".format(key, dict[key]))

输出：

a : [['a', 213, 234567], ['a', 454, 457900], ['a', 562, 340094], ['a', 200, 456704]]

b : [['b', 400, 850988], ['b', 590, 344433]]

c : [['c', 565, 678635], ['c', 300, 453432], ['c', 555, 563546], ['c', 1, 660905]]

【讨论】：

【解决方案2】：

这样的？

csv_string = """Key,Branch,Account 
a,213,234567
a,454,457900
a,562,340094
a,200,456704
b,400,850988
b,590,344433
c,565,678635
c,300,453432
c,555,563546
c,001,660905"""

import csv
import io

#
# 1. Parse csv_string into a list of ordereddicts
#

def parse_csv(string):
    # if you are reading from a file you don't need to do this
    # StringIO nonsense -- just pass the file to csv.DictReader()
    string_file = io.StringIO(string)
    reader = csv.DictReader(string_file)
    return list(reader)

csv_table = parse_csv(csv_string)

#
# 2. Loop through each line of the table and get the key
#  - If we have seen the key before, put the line in the list
#    with other lines that had the same key
#  - If not, start a new list for that key
#

result = {}

for line in csv_table:
    key = line["Key"].strip()
    print(key, ":", line)
    if key in result:
        result[key].append(line)
    else:
        result[key] = [line]

#
# 3. Finally, print the result.
# The lines will probably be easier to deal with if you keep them 
# in their parsed form, but for readability we can join the values
# of the line back into a string with commas
#

print(result)
print("")

for key_list in result.values():
    for line in key_list:
        print(",".join(line.values()))
    print("")

【讨论】：