如何在 python 中操作列上具有相同值的 CSV 并创建一个具有唯一值的新列答案

【问题标题】：How to manipulate a CSV in python with same values on a column and create a new one with unique values如何在 python 中操作列上具有相同值的 CSV 并创建一个具有唯一值的新列
【发布时间】：2019-04-04 08:11:07
【问题描述】：

我必须面对的问题是，我有一个 csv 文件，在多个列上包含相同的数据（这里是 unique_code），并且我想创建一个新的 csv，其中只有一次该列上的数据和其他列中的数据如果不同则用空格分隔（这里是替代代码）。

这是我的 csv。

唯一码描述替代码

33;product1;58

43;product2;95

33;product1;62

68;product3;11

43;product2;99

我想要的 csv 结果：

33;product1;58 62

43;product2;95 99

68;product3;11

关于如何实现我的新 csv 的任何想法？

【问题讨论】：

你看过熊猫吗？如果没有，您做了哪些尝试来解析输入文件？
@cricket_007 不，我还没有检查过，但我会的。

标签： python csv parsing

【解决方案1】：

你可以试试这样的：

vals = {}
names = {}
with open(input_filename,'r') as file:
    for line in file:
        l = line.replace("\n","")
        l = l.split(";")
        if(vals.has_key(l[0])):
            vals[l[0]].append(l[2])
        else:
            vals[l[0]] = [l[2]]
            names[l[0]] = l[1]

with open(output_filename,'w') as file:
    for key in vals.keys():

        res = str(key)+";"+str(names[key])+";"+str(vals[key][0])

        for i in range(0,len(vals[key])-1):
            res += " "+vals[key][i+1]
        res += '\n'

        file.write(res)

【讨论】：

这是一个很好的方法，但对我来说部分有效。抱歉，但我没有提到 unique_code 可能会显示为 alternative_code 值。因此，我还希望有一个约束，即如果作为 alternative_code 存在，则不将 unique_code 的值添加到由空格分隔的替代代码的值中。错误前：33;product1;**33** 58 62 正确前：33;product1;58 62

【解决方案2】：

import csv

with open("my_file.csv", 'r') as fd:
    #import csv as list of list and remove blank line                
    data = [i for i in csv.reader(fd, delimiter=';') if i]                                       
    result = []
    for value in data:
        #check if product not in result 
        if value[1] not in [r[1] for r in result if r]:
            #add the new product to result with all values for the same product 
            result.append([value[0],
                           value[1],
                           ' '.join([line[2] for line in data if line[1] == value[1]])
                         ])
    print(result)

【讨论】：

【解决方案3】：

最后我得到了这个解决方案：

# -*- coding: utf-8 -*-
import csv

input_file_1 = "eidi.csv"
output_file = "output.csv"

parsed_dictionary={}

def concatenate_alter_codes(alter_code_list):
    result = ""
    for alter_code in alter_code_list:
        result = result + (alter_code + " ")
        print result
    return result[:-1]

#Read input csv file and create a dictionary with a list of all alter codes
with open(input_file_1,'r') as f:
    # put ; symbol as delimeter
    input_csv=csv.reader(f,delimiter=';')
    for row in input_csv:
        # if the key exists in the dictionary
    if row[0] in parsed_dictionary:
        parsed_dictionary[row[0]][0].append(row[2])
    else:
        parsed_dictionary[row[0]] = [[row[2]], row[1], row[3], row[4], row[5], row[6]]

#create new csv file with concatenated alter codes

with open(output_file,'w') as f:
    for key in parsed_dictionary:
                f.write(key + ";" + concatenate_alter_codes(parsed_dictionary[key][0]) + ";" + parsed_dictionary[key][1] + ";" + parsed_dictionary[key][2] + ";" + parsed_dictionary[key][3] + ";" + parsed_dictionary[key][4] + ";" + parsed_dictionary[key][5] + "\n")

【讨论】：

【解决方案4】：

littletable 是我多年前编写的一个精简的 CSV 包装器。 littletable 中的表格是对象列表，带有一些用于过滤、连接、透视的辅助方法，以及 CSV、JSON 和固定格式数据的轻松导入/导出。与 pandas 一样，它有助于数据导入/导出，但不具备 pandas 所具有的所有其他数值分析功能。它还将所有数据作为 Python 对象列表保存在内存中，因此它不会像 pandas 那样处理数百万行。但如果您的需求不大，那么使用 littletable 可能会缩短学习曲线。

要将您的初始原始数据加载到 littletable 表格中，请以：

import littletable as lt
data = open('raw_data.csv')
tt = lt.Table().csv_import(data, fieldnames="id name altid".split(), delimiter=';')

（如果您的输入文件中有标题行，csv_import 将使用该标题行，并且不需要您指定 fieldnames。）

打印行看起来就像遍历列表：

for row in tt:
    print(row)

打印：

{'name': 'product1', 'altid': '58', 'id': '33'}
{'name': 'product2', 'altid': '95', 'id': '43'}
{'name': 'product1', 'altid': '62', 'id': '33'}
{'name': 'product3', 'altid': '11', 'id': '68'}
{'name': 'product2', 'altid': '99', 'id': '43'}

因为我们将在id 属性上进行分组和连接，所以我们添加了一个索引：

tt.create_index("id")

（也可以创建唯一索引，但在这种情况下，您的原始输入中存在具有相同 id 的重复值。）

表可以按一个或多个属性进行分组，然后可以将每组记录传递给一个函数，以给出该组的聚合值。在您的情况下，您希望为每个产品 id 收集所有 altids。

def aggregate_altids(rows):
    return ' '.join(set(row.altid for row in rows if row.altid != row.id))
grouped_altids = tt.groupby("id", altids=aggregate_altids)

for row in grouped_altids:
    print(row)

给予：

{'altids': '62 58', 'id': '33'}
{'altids': '99 95', 'id': '43'}
{'altids': '11', 'id': '68'}

现在我们将把这个表与id 上的原始tt 表连接起来，并折叠重复项：

tt2 = (grouped_altids.join_on('id') + tt)().unique("id")

并打印出结果：

for row in tt2:
    print("{id};{name};{alt_ids}".format_map(vars(row)))

给予：

33;product1;58 62
43;product2;95 99
68;product3;11

没有调试的总代码如下：

# import
import littletable as lt
with open('raw_data.csv') as data:
    tt = lt.Table().csv_import(data, fieldnames="id name altid".split(), delimiter=';')
tt.create_index("id")

# group
def aggregate_altids(rows):
    return ' '.join(set(row.altid for row in rows if row.altid != row.id))
grouped_altids = tt.groupby("id", alt_ids=aggregate_altids)

# join, dedupe, and sort
tt2 = (grouped_altids.join_on('id') + tt)().unique("id").sort("id")

# output
for row in tt2:
    print("{id};{name};{alt_ids}".format_map(vars(row)))

【讨论】：