【问题标题】:Counting data within ranges in csv计算csv范围内的数据
【发布时间】:2013-10-14 07:32:26
【问题描述】:

我有一些数据需要分解成可管理的块。使用以下数据,我需要计算 x 在第 11 列中出现的次数,第 7 列是 1,以及数字 x 在第 11 列中出现的次数。我需要将它们放入 csv 的第一行。之后我需要计算同样的东西,但第 11 列是以下括号:

0

">0 但

">0.05 但

">0.1 但

理想情况下,所有这些都将附加到同一个 new.csv 中,即不是主要数据 csv

一些符合上述描述的示例原始数据(请注意,很多括号将不包含数据。在这种情况下,它们需要返回 0,0:

01/01/2002,Data,class1,4,11yo+,4,1,George Smith,0,0,x
01/01/2002,Data,class1,4,11yo+,4,2,Ted James,0,0,x
01/01/2002,Data,class1,4,11yo+,4,3,Emma Lilly,0,0,x
01/01/2002,Data,class1,4,11yo+,4,5,George Smith,0,0,x
02/01/2002,Data,class2,4,10yo+,6,4,Tom Phillips,0,0,x
02/01/2002,Data,class2,4,10yo+,6,2,Tom Phillips,0,0,x
02/01/2002,Data,class2,4,10yo+,6,5,George Smith,1,2,0.5
02/01/2002,Data,class2,4,10yo+,6,3,Tom Phillips,0,0,x
02/01/2002,Data,class2,4,10yo+,6,1,Emma Lilly,0,1,0
02/01/2002,Data,class2,4,10yo+,6,6,George Smith,1,2,0.5
03/01/2002,Data,class3,4,10yo+,6,6,Ted James,0,1,0
03/01/2002,Data,class3,4,10yo+,6,3,Tom Phillips,0,3,0
03/01/2002,Data,class3,4,10yo+,6,2,George Smith,1,4,0.25
03/01/2002,Data,class3,4,10yo+,6,4,George Smith,1,4,0.25
03/01/2002,Data,class3,4,10yo+,6,1,George Smith,1,4,0.25
03/01/2002,Data,class3,4,10yo+,6,5,Tom Phillips,0,3,0
04/01/2002,Data,class4,2,10yo+,5,3,Emma Lilly,1,2,0.5
04/01/2002,Data,class4,2,10yo+,5,1,Ted James,0,2,0
04/01/2002,Data,class4,2,10yo+,5,2,George Smith,2,7,0.285714286
04/01/2002,Data,class4,2,10yo+,5,4,Emma Lilly,1,2,0.5
04/01/2002,Data,class4,2,10yo+,5,5,Tom Phillips,0,5,0
05/01/2002,Data,class5,4,11yo+,4,1,George Smith,2,8,0.25
05/01/2002,Data,class5,4,11yo+,4,2,Ted James,1,3,0.333333333
05/01/2002,Data,class5,4,11yo+,4,3,Emma Lilly,1,4,0.25
05/01/2002,Data,class5,4,11yo+,4,5,George Smith,2,8,0.25
06/01/2002,Data,class6,4,10yo+,6,4,Tom Phillips,0,6,0
06/01/2002,Data,class6,4,10yo+,6,2,Tom Phillips,0,6,0
06/01/2002,Data,class6,4,10yo+,6,5,George Smith,3,10,0.3
06/01/2002,Data,class6,4,10yo+,6,3,Tom Phillips,0,6,0
06/01/2002,Data,class6,4,10yo+,6,1,Emma Lilly,1,5,0.2
06/01/2002,Data,class6,4,10yo+,6,6,George Smith,3,10,0.3
07/01/2002,Data,class7,4,10yo+,6,6,Ted James,1,4,0.25
07/01/2002,Data,class7,4,10yo+,6,3,Tom Phillips,0,9,0
07/01/2002,Data,class7,4,10yo+,6,2,George Smith,3,12,0.25
07/01/2002,Data,class7,4,10yo+,6,4,George Smith,3,12,0.25
07/01/2002,Data,class7,4,10yo+,6,1,George Smith,3,12,0.25
07/01/2002,Data,class7,4,10yo+,6,5,Tom Phillips,0,9,0
08/01/2002,Data,class8,2,10yo+,5,3,Emma Lilly,2,6,0.333333333
08/01/2002,Data,class8,2,10yo+,5,1,Ted James,1,5,0.2
08/01/2002,Data,class8,2,10yo+,5,2,George Smith,4,15,0.266666667
08/01/2002,Data,class8,2,10yo+,5,4,Emma Lilly,2,6,0.333333333
08/01/2002,Data,class8,2,10yo+,5,5,Tom Phillips,0,11,0
09/01/2002,Data,class9,4,11yo+,4,1,George Smith,4,16,0.25
09/01/2002,Data,class9,4,11yo+,4,2,Ted James,2,6,0.333333333
09/01/2002,Data,class9,4,11yo+,4,3,Emma Lilly,2,8,0.25
09/01/2002,Data,class9,4,11yo+,4,5,George Smith,4,16,0.25
10/01/2002,Data,class10,4,10yo+,6,4,Tom Phillips,0,12,0
10/01/2002,Data,class10,4,10yo+,6,2,Tom Phillips,0,12,0
10/01/2002,Data,class10,4,10yo+,6,5,George Smith,5,18,0.277777778
10/01/2002,Data,class10,4,10yo+,6,3,Tom Phillips,0,12,0
10/01/2002,Data,class10,4,10yo+,6,1,Emma Lilly,2,9,0.222222222
10/01/2002,Data,class10,4,10yo+,6,6,George Smith,5,18,0.277777778
11/01/2002,Data,class11,4,10yo+,6,6,Ted James,2,7,0.285714286
11/01/2002,Data,class11,4,10yo+,6,3,Tom Phillips,0,15,0
11/01/2002,Data,class11,4,10yo+,6,2,George Smith,5,20,0.25
11/01/2002,Data,class11,4,10yo+,6,4,George Smith,5,20,0.25
11/01/2002,Data,class11,4,10yo+,6,1,George Smith,5,20,0.25
11/01/2002,Data,class11,4,10yo+,6,5,Tom Phillips,0,15,0
12/01/2002,Data,class12,2,10yo+,5,3,Emma Lilly,3,10,0.3
12/01/2002,Data,class12,2,10yo+,5,1,Ted James,2,8,0.25
12/01/2002,Data,class12,2,10yo+,5,2,George Smith,6,23,0.260869565
12/01/2002,Data,class12,2,10yo+,5,4,Emma Lilly,3,10,0.3
12/01/2002,Data,class12,2,10yo+,5,5,Tom Phillips,0,17,0
13/01/2002,Data,class13,4,11yo+,4,1,George Smith,6,24,0.25
13/01/2002,Data,class13,4,11yo+,4,2,Ted James,3,9,0.333333333
13/01/2002,Data,class13,4,11yo+,4,3,Emma Lilly,3,12,0.25
13/01/2002,Data,class13,4,11yo+,4,5,George Smith,6,24,0.25
14/01/2002,Data,class14,4,10yo+,6,4,Tom Phillips,0,18,0
14/01/2002,Data,class14,4,10yo+,6,2,Tom Phillips,0,18,0
14/01/2002,Data,class14,4,10yo+,6,5,George Smith,7,26,0.269230769
14/01/2002,Data,class14,4,10yo+,6,3,Tom Phillips,0,18,0
14/01/2002,Data,class14,4,10yo+,6,1,Emma Lilly,3,13,0.230769231
14/01/2002,Data,class14,4,10yo+,6,6,George Smith,7,26,0.269230769
15/01/2002,Data,class15,4,10yo+,6,6,Ted James,3,10,0.3

如果有人可以帮助我实现这一目标,我将不胜感激。如果这需要更多详细信息,请询问。

最后一点,相关的 csv 有主要数据 相关的 csv 有 800k 行。

编辑

目前使用@user650654提供的代码输出文件如下所示:

data1,data2

如果可能的话,我希望对代码稍作改动,以多输出两件事。希望这些不会太难做。对输出文件的建议更改(逗号代表每个新行):

title row labeling the row (e.g. "x" or "0:0.05",Calculated avereage of values within each bracket e.g."0.02469",data1,data2

所以实际上它可能看起来像这样:

x,n/a,data1,data2
0:0.05,0.02469,data1,data2
0.05:0.1,0.5469,data1,data2
....
....

Column1 = 行标签(原始问题中计算的数据范围,即从 0 到 0.05 Column2 = 特定范围内的计算平均值。 IE。如果 请注意 data1 和 data2 是问题最初要求的两个值。 第 1 列

非常感谢 AEA

【问题讨论】:

  • 到目前为止你得到了什么?你知道如何读取文件吗?
  • “特定范围内的计算平均值”:是 data1 值的平均值还是 data2 值的平均值?即第 7 列等于 1 的值的平均值或该范围的 所有 值的平均值?
  • 啊,好点,该范围内所有值的平均值

标签: python csv python-2.7


【解决方案1】:

这是添加两个新字段的解决方案:

import csv
import numpy


def count(infile='data.csv', outfile='new.csv'):
    bins = numpy.arange(0, 1.05, 0.05)

    total_x = 0
    col7one_x = 0

    total_zeros = 0
    col7one_zeros = 0

    all_array = []
    col7one_array = []

    with open(infile, 'r') as fobj:
        reader = csv.reader(fobj)
        for line in reader:
            if line[10] == 'x':
                total_x += 1
                if line[6] == '1':
                    col7one_x += 1
            elif line[10] == '0':
                # assumes zero is represented as "0" and not as say, "0.0"
                total_zeros += 1
                if line[6] == '1':
                    col7one_zeros += 1
            else:
                val = float(line[10])
                all_array.append(val)
                if line[6] == '1':
                    col7one_array.append(val)

    all_array = numpy.array(all_array)
    hist_all, edges = numpy.histogram(all_array, bins=bins)
    hist_col7one, edges = numpy.histogram(col7one_array, bins=bins)
    bin_ranges = ['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]

    digitized = numpy.digitize(all_array, bins)
    bin_means = [all_array[digitized == i].mean() if hist_all[i - 1] else 'n/a' for i in range(1, len(bins))]


    with open(outfile, 'w') as fobj:
        writer = csv.writer(fobj)
        writer.writerow(['x', 'n/a', col7one_x, total_x])
        writer.writerow(['0', 0 if total_zeros else 'n/a', col7one_zeros, total_zeros])
        for row in zip(bin_ranges, bin_means, hist_col7one, hist_all):
            writer.writerow(row)


if __name__ == '__main__':
    count()

【讨论】:

  • 请有机会获得两个脚本的评论,非常感谢 AEA
  • 我意识到没有必要评论其他脚本,但您为此提供的任何 cmets 将不胜感激。
  • 嘿抱歉再次询问,这段代码运行良好,但我真的可以使用 cmets 帮助我编辑它。
  • 您好@user650654 再次感谢此代码,我已授予它赏金,但我真的非常感谢代码上的一些 cmets,使我能够更有效地对其进行逆向工程。非常感谢 AEA
【解决方案2】:

这可能有效:

import numpy as np
import pandas as pd


column_names = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 
              'col7', 'col8', 'col9', 'col10', 'col11'] #names to be used as column labels.  If no names are specified then columns can be refereed to by number eg. df[0], df[1] etc.

df = pd.read_csv('data.csv', header=None, names=column_names) #header= None means there are no column headings in the  csv file

df.ix[df.col11 == 'x', 'col11']=-0.08 #trick so that 'x' rows will be grouped into a category >-0.1 and <= -0.05.  This will allow all of col11 to be treated as a numbers

bins = np.arange(-0.1, 1.0, 0.05) #bins to put col11 values in.  >-0.1 and <=-0.05 will be our special 'x' rows, >-0.05 and <=0 will capture all the '0' values.
labels = np.array(['%s:%s' % (x, y) for x, y in zip(bins[:-1], bins[1:])]) #create labels for the bins
labels[0] = 'x' #change first bin label to 'x'
labels[1] = '0' #change second bin label to '0'

df['col11'] = df['col11'].astype(float) #convert col11 to numbers so we can do math on them


df['bin'] = pd.cut(df['col11'], bins=bins, labels=False) # make another column 'bins' and put in an integer representing what bin the number falls into.Later we'll map the integer to the bin label


df.set_index('bin', inplace=True, drop=False, append=False) #groupby is meant to run faster with an index

def count_ones(x):
    """aggregate function to count values that equal 1"""
    return np.sum(x==1)

dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]}) # groupby the bin number and apply aggregate functions to specified column.
dfg.index = labels[dfg.index]# apply labels to bin numbers

dfg.ix['x',('col11', 'mean')]='N/A' #mean of 'x' rows is meaningless
print(dfg)
dfg.to_csv('new.csv')

这给了我

                col7           col11
          count_ones  len       mean
x                  1    7        N/A
0                  2   21          0
0.15:0.2           2    2        0.2
0.2:0.25           9   22  0.2478632
0.25:0.3           0   13  0.2840755
0.3:0.35           0    5  0.3333333
0.45:0.5           0    4        0.5

【讨论】:

  • 嘿 rtrwalker 这段代码真的很棒。我想知道您是否可以进一步评论。到目前为止,我还没有使用过 pandas,并且一直在尽力操作并添加到此代码中。但无济于事。我一直在 youtube 上观看视频,试图掌握它。我认为是binszip 的使用真的让我很震惊。
  • 虽然我已经获得了赏金,但如果我能更好地理解它,我可能会将其标记为已接受的答案。我最挣扎的线路是dfg = df[['bin','col7','col11']].groupby('bin').agg({'col11': [np.mean], 'col7': [count_ones, len]})
  • 事实上,如果可以做到这一点,我很乐意为您分配相同的 50 分赏金!
  • 显然,一个问题的第二次赏金必须是两倍,所以它将是 100 分赏金
  • 我已经提出了一个完整的问题,奖励 100 分,致力于尝试理解您提供的熊猫代码。见这里:stackoverflow.com/questions/19748000/…
【解决方案3】:

此解决方案使用numpy.histogram。见下文。

import csv
import numpy


def count(infile='data.csv', outfile='new.csv'):
    total_x = 0
    col7one_x = 0
    total_zeros = 0
    col7one_zeros = 0
    all_array = []
    col7one_array = []
    with open(infile, 'r') as fobj:
        reader = csv.reader(fobj)
        for line in reader:
            if line[10] == 'x':
                total_x += 1
                if line[6] == '1':
                    col7one_x += 1
            elif line[10] == '0':
                # assumes zero is represented as "0" and not as say, "0.0"
                total_zeros += 1
                if line[6] == '1':
                    col7one_zeros += 1
            else:
                val = float(line[10])
                all_array.append(val)
                if line[6] == '1':
                    col7one_array.append(val)

    bins = numpy.arange(0, 1.05, 0.05)
    hist_all, edges = numpy.histogram(all_array, bins=bins)
    hist_col7one, edges = numpy.histogram(col7one_array, bins=bins)

    with open(outfile, 'w') as fobj:
        writer = csv.writer(fobj)
        writer.writerow([col7one_x, total_x])
        writer.writerow([col7one_zeros, total_zeros])
        for row in zip(hist_col7one, hist_all):
            writer.writerow(row)


if __name__ == '__main__':
    count()

【讨论】:

  • 嘿,不错的答案,有机会得到这个评论吗?非常感谢:)
  • 您好,答案当然有效,但是输出在新行之间有空格。
  • 为问题添加了更多细节和奖励。
  • 请有机会获得两个脚本的评论,非常感谢 AEA
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2021-10-27
  • 2011-02-05
  • 2013-05-29
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多