将字典中的值打印到新的 csv 文件答案

【问题标题】：Print out values in a dictionary to a new csv file将字典中的值打印到新的 csv 文件
【发布时间】：2020-02-26 02:34:42
【问题描述】：

我有一个 csv 文件，看起来像这样

year,gender,age,country
2002,F,9-10,CO
2002,F,9-10,CO
2002,M,9-10,CO
2002,F,9-10,BR
2002,M,11-15,BR
2002,F,11-15,CO
2003,F,9-10,CO
2003,M,9-10,CO
2003,F,9-10,BR
2003,M,9-10,CO
2004,F,11-15,BR
2004,F,11-15,CO
2004,F,9-10,BR
2004,F,9-10,CO

我想得到这样的输出文件：

year,gender,age,country,population
2002,F,9-10,CO,2
2002,M,9-10,CO,1
2002,F,9-10,BR,1
2002,M,9-10,BR,0
2002,F,11-15,CO,1
2002,M,11-15,CO,0
2002,F,11-15,BR,0
2002,M,11-15,BR,1
2003,F,9-10,CO,1
2003,M,9-10,CO,1
2003,F,9-10,BR,1
2003,M,9-10,BR,0
2003,F,11-15,CO,0
2003,M,11-15,CO,0
2004,F,9-10,CO,1
2004,M,9-10,CO,0
2004,F,9-10,BR,1
2004,M,9-10,BR,0
2004,F,11-15,CO,1
2004,M,11-15,CO,0
2004,F,11-15,BR,1
2004,M,11-15,BR,0

基本上我想打印出每年、每个年龄和每个国家的女性人数，所以年份、性别、年龄和国家将是字典的键。此外，有些年份没有特定国家的数据，或者有些年份没有特定国家的特定年龄。例如，2003 年，CO 国没有 11-15 岁年龄段的女性数据。在这种情况下，人口将为 0。而且，有些年份根本没有特定的性别数据。例如，对于 2004 年，没有所有年龄和国家的男性数据，但我仍然想在人口 0 的输出文件中打印出来。

以下是我编写的一些python代码，但它不起作用，我不知道如何处理丢失的数据并在人口字段中将其打印为0。

import csv
import os
import sys
from operator import itemgetter, attrgetter
import math
from collections import Counter

# Create dictionary to hold the data
valDic = {}

# Read data into dictionary
with open(sys.argv[1], "r",) as inputfile:
    readcsv = csv.reader(inputfile, delimiter = ',')    
    next(readcsv)
    for line in readcsv:
        key = line[0] + line[1] + line[2] + line[3]
        year = line[0]
        gender = line[1]
        age = line[2]
        country = line[3]
        if key in valDic:
            key = key + 1
        else:
            valDic[key] = [year, gender, age, country, 0] # 0s are placeholder for running sum and itemCount
    inputfile.close()  

newcsvfile = []

for key in valDic:
    newcsvfile.append([valDic[key][0], valDic[key][1], valDic[key][2], valDic[key][3], len(valDic[key])])

newcsvfile = sorted(newcsvfile)
newcsvfile = [["year", "gender", "age", "country", "population"]] 

with open(sys.argv[2], "w") as outputfile:
    writer = csv.writer(outputfile)
    writer.writerows(newcsvfile)

【问题讨论】：

使用df.groupby(['year', 'genter', 'age', 'country'])你可以数数
您可以在开始时创建所有键和值都为 0 的字典。如果某些键在 csv 中不存在，那么您的字典中将为 0。
@furas 你能更具体一点吗？在真实文件中有超过 2 个国家和年龄，我无法全部列出。我是编码新手，所以我不知道如何像你说的那样用所有键创建一个 dic
要获得 0 的缺失值，您必须首先创建包含所有国家/地区、所有年龄范围的列表。使用这些列表，您可以在填写字典后检查字典中是否缺少数据。或者在开始时，您可以创建所有项目都填充为零的字典，然后从 csv 添加值。因此，您必须先读取 csv 才能获取所有国家和所有年龄范围，然后再读取 csv 以创建字典。

标签： python pandas numpy data-cleaning

【解决方案1】：

我们可以将年份、性别、年龄、国家/地区的每个组合存储为一个元组，并将其用作字典的键。我们还维护了这些值中的每一个的唯一集合。我们迭代我们看到的每一个组合，如果数据不存在（比如在 2004 年，只有女性存在，但没有男性）；然后我们可以为此添加“0”。

演示：

import csv
import sys

# Create dictionary to hold the data
valDic = {}

years, genders, age, country = set(), set(), set(), set()

# Read data into dictionary
with open(sys.argv[1], 'r',) as inputfile:

    reader = csv.reader(inputfile, delimiter = ',')
    next(reader)

    for row in reader:

        key = (row[0], row[1], row[2], row[3])

        years.add(key[0])
        genders.add(key[1])
        age.add(key[2])
        country.add(key[3])

        if key not in valDic:
            valDic[key]=0

        valDic[key]+=1


#Add missing combinations
for y in years:
    for g in genders:
        for a in age:
            for c in country:
                key = (y, g, a, c)
                if key not in valDic:
                    valDic[key]=0

#Prepare new CSV
newcsvfile = [["year", "gender", "age", "country", "population"]] 

for key, val in sorted(valDic.items()):
    newcsvfile.append([key[0], key[1], key[2], key[3], valDic[key]])

with open(sys.argv[2], "w", newline='') as outputfile:
    writer = csv.writer(outputfile)
    writer.writerows(newcsvfile)

输出：

year,gender,age,country,population
2002,F,11-15,BR,0
2002,F,11-15,CO,1
2002,F,9-10,BR,1
2002,F,9-10,CO,2
2002,M,11-15,BR,1
2002,M,11-15,CO,0
2002,M,9-10,BR,0
2002,M,9-10,CO,1
2003,F,11-15,BR,0
2003,F,11-15,CO,0
2003,F,9-10,BR,1
2003,F,9-10,CO,1
2003,M,11-15,BR,0
2003,M,11-15,CO,0
2003,M,9-10,BR,0
2003,M,9-10,CO,2
2004,F,11-15,BR,1
2004,F,11-15,CO,1
2004,F,9-10,BR,1
2004,F,9-10,CO,1
2004,M,11-15,BR,0
2004,M,11-15,CO,0
2004,M,9-10,BR,0
2004,M,9-10,CO,0

【讨论】：

【解决方案2】：

我会为此使用pandas。

我可以阅读所有内容并创建DataFrame

import pandas as pd

df = pd.read_csv(sys.argv[1])

使用groupby，我可以对行进行分组并对它们进行计数，以获得现有数据的population。它创建具有不同顺序列的列表列表，但稍后我会将其转换为新的DataFrame 以更改列顺序和对行进行排序。

groups = df.groupby(['year', 'age', 'country', 'gender'])

data = []

for index, group in groups:
    data.append([*index, len(group)]) # create row with population

Usign .unique() 我可以获取列中的所有唯一值。

unique_years     = df['year'].unique()
unique_genders   = df['gender'].unique()
unique_age       = df['age'].unique()
unique_countries = df['country'].unique()

我将它们与itertools.product 一起使用来创建年份、性别、年龄、国家/地区的所有可能组合，以检查数据中缺少哪个组合以将其添加到0

我可以找到之前的现有组合groups.indices

import itertools

all_indices = groups.indices

for index in itertools.product(all_years, all_age, all_countries, all_genders):
    if index not in indices:
        data.append([*index, 0]) # add missing row

之后我有了所有数据，我可以转换为 DataFrame 来更改列顺序和对行进行排序

# create DataFrame with new values
final_df = pd.DataFrame(data, columns=['year', 'age', 'country', 'gender',  'population'])

# change columns order
final_df = final_df[['year', 'gender', 'age', 'country', 'population']]

# sort by 
final_df = final_df.sort_values(['year', 'age', 'country', 'gender'], ascending=[True, False, False, True])

我终于可以将它保存在新的 csv 中了

final_df.to_csv(sys.argv[2], index=False)

完整的工作示例 - 我使用 io.StringIO 来模拟内存中的文件，而不是从文件中读取 - 这样每个人都可以在没有完整数据的情况下复制和测试它。

text = '''year,gender,age,country
2002,F,9-10,CO
2002,F,9-10,CO
2002,M,9-10,CO
2002,F,9-10,BR
2002,M,11-15,BR
2002,F,11-15,CO
2003,F,9-10,CO
2003,M,9-10,CO
2003,F,9-10,BR
2003,M,9-10,CO
2004,F,11-15,BR
2004,F,11-15,CO
2004,F,9-10,BR
2004,F,9-10,CO'''

#---------------------------------------

import pandas as pd

#df = pd.read_csv(sys.argv[1])

import io
df = pd.read_csv(io.StringIO(text))

print(df)

#---------------------------------------

groups = df.groupby(['year', 'age', 'country', 'gender'])

data = []

for index, group in groups:
    data.append([*index, len(group)])

#---------------------------------------

unique_years     = df['year'].unique()
unique_genders   = df['gender'].unique()
unique_age       = df['age'].unique()
unique_countries = df['country'].unique()

#print('years    :', unique_years)
#print('genders  :', unique_genders)
#print('age      :', unique_age)
#print('countries:', unique_countries)

import itertools

all_indices = groups.indices

for index in itertools.product(all_years, all_age, all_countries, all_genders):
    if index not in indices:
        data.append([*index, 0])

#---------------------------------------

# create DataFrame with new values
final_df = pd.DataFrame(data, columns=['year', 'age', 'country', 'gender',  'population'])

# change columns order
final_df = final_df[['year', 'gender', 'age', 'country', 'population']]

# sort by 
final_df = final_df.sort_values(['year', 'age', 'country', 'gender'], ascending=[True, False, False, True])

# reset index
final_df = final_df.reset_index(drop=True)
print(final_df)

# save in file
#final_df.to_csv(sys.argv[2], index=False)
final_df.to_csv('output.csv', index=False)

【讨论】：