Python CSV - 检查不同行上的索引是否相等答案

【问题标题】：Python CSV - Check if index is equal on different rowsPython CSV - 检查不同行上的索引是否相等
【发布时间】：2014-03-02 02:44:26
【问题描述】：

我正在尝试创建代码来检查 CSV 索引列中的值在不同行中是否相等，如果是，则在其他列中查找出现次数最多的值并将其用作最终数据。不是很好的解释，基本上我想取这个data.csv：

customer_ID,month,time,A,B,C
1003,Jan,2:00,1,1,4
1003,Jul,2:00,1,1,3
1003,Jan,2:00,1,1,4
1004,Feb,8:00,2,5,1
1004,Jul,8:00,2,4,1

并创建一个新的 answer.csv 来识别同一客户的多行，因此它会找到每列中出现次数最多的值并将这些值输出到一行中：

customer_ID,month,ABC
1003,Jan,114
1004,Feb,251

我还想了解，如果存在相同次数的值（客户 1004 的月份和 B），我该如何选择要输出的值？

我目前已经写过（感谢安迪·海登（Andy Hayden）我刚刚问过的上一个问题）：

import pandas as pd
df = pd.read_csv('data.csv', index_col='customer_ID')
res = df[list('ABC')].astype(str).sum(1)
print df
res.to_frame(name='answer').to_csv('answer.csv')

然而，所有这一切都是创建这个（我之前忽略了一个月，但现在我想将它合并，这样我就可以学习如何不仅找到一列数字的模式，而且还能找到最出现的字符串）：

customer_ID,ABC
1003,114.0
1003,113.0
1003,114.0
1004,251.0
1004,241.0

注意：我不知道为什么它在 ABC 末尾输出 .0，它似乎是错误的变量格式。我希望每列仅作为 3 位数字输出。

编辑：我还有一个问题，如果 A 列中的值为 0，那么输出将变为 2 位，并且不包含前导 0。

【问题讨论】：

为什么需要这种特定的格式？最终目标是什么？
最终目标是比赛的提交文件。他们想要的格式是 customer_ID,ABC。他们只希望每个 customer_ID 有一行，所以我想知道是否有一种方法可以将具有相同 customer_ID 的多行组合起来，并将这些行中出现次数最多的数据用作该客户的最终单行输出
“最常出现的数据”是什么意思？
我想要的最终结果是我问题中的第二个代码块。对于大多数出现的数据，我的意思是我希望它识别“customer_ID 1003 在 3 行上。对于月份，数据是 Jan、Jul、Jan。”它识别出 Jan 发生了两次，而 Jul 发生了一次，因此输出 1003,Jan。
我认为这里的 csv 位是噪音，你真的应该试着问问你想用 pandas DataFrame 做什么！

标签： python csv pandas multiple-columns

【解决方案1】：

这样的事情呢？不过这不是使用 Pandas，我不是 Pandas 专家。

from collections import Counter

dataDict = {}

# Read the csv file, line by line
with open('data.csv', 'r') as dataFile:
    for line in dataFile:

        # split the line by ',' since it is a csv file...
        entry = line.split(',')

        # Check to make sure that there is data in the line
        if entry and len(entry[0])>0:

            # if the customer_id is not in dataDict, add it
            if entry[0] not in dataDict:
                dataDict[entry[0]] = {'month':[entry[1]],
                                   'time':[entry[2]],
                                   'ABC':[''.join(entry[3:])],
                                   }
            # customer_id is already in dataDict, add values
            else:
                dataDict[entry[0]]['month'].append(entry[1])
                dataDict[entry[0]]['time'].append(entry[2])
                dataDict[entry[0]]['ABC'].append(''.join(entry[3:]))


# Now write the output file
with open('out.csv','w') as f:

    # Loop through sorted customers
    for customer in sorted(dataDict.keys()):

        # use Counter to find the most common entries
        commonMonth = Counter(dataDict[customer]['month']).most_common()[0][0]
        commonTime = Counter(dataDict[customer]['time']).most_common()[0][0]
        commonABC = Counter(dataDict[customer]['ABC']).most_common()[0][0]

        # Write the line to the csv file
        f.write(','.join([customer, commonMonth, commonTime, commonABC, '\n']))

它会生成一个名为out.csv 的文件，如下所示：

1003,Jan,2:00,114,
1004,Feb,8:00,251,
customer_ID,month,time,ABC,

【讨论】：