用python的numpy计算csv的平均值答案

【问题标题】：calculating means from csv with python's numpy用python的numpy计算csv的平均值
【发布时间】：2014-11-04 11:39:39
【问题描述】：

我有一个 10GB（无法放入 RAM）格式的文件：

Col1,Col2,Col3,Col4
1,2,3,4
34,256,348,
12,,3,4

所以我们有列和缺失值，我想计算第 2 列和第 3 列的平均值。使用普通 python，我会执行以下操作：

def means(rng):
    s, e = rng

    with open("data.csv") as fd:
        title = next(fd)
        titles = title.split(',')
        print "Means for", ",".join(titles[s:e])

        ret = [0] * (e-s)
        for c, l in enumerate(fd):
            vals = l.split(",")[s:e]
            for i, v in enumerate(vals):
                try:
                    ret[i] += int(v)
                except ValueError:
                    pass

        return map(lambda s: float(s) / (c + 1), ret)

但我怀疑有一种更快的方法可以用 numpy 进行瘦身（我还是个新手）。

【问题讨论】：

您希望总和除以行数，还是除以非缺失值的数量（对于每列）？
这并不重要，缺失值不应超过列的 1%，我对那么高的准确性不感兴趣。哪个更容易。

标签： python csv numpy mean

【解决方案1】：

Pandas 是你最好的朋友：

from pandas.io.parsers import read_csv
from numpy import sum

# Load 10000 elements at a time, you can play with this number to get better
# performance on your machine
my_data = read_csv("data.csv", chunksize=10000)

total = 0
count = 0

for chunk in my_data:
    # If you want to exclude NAs from the average, remove the next line
    chunk = chunk.fillna(0.0)

    total += chunk.sum(skipna=True)
    count += chunk.count()

avg = total / count

col1_avg = avg["Col1"]
# ... etc. ...

【讨论】：

对不起，我忘了提到文件是 10G，我的内存远小于这个。
你必须稍微调整你的平均计算，但解析可以分块：pandas.pydata.org/pandas-docs/stable/io.html#io-chunking
您需要进一步调整来处理缺失值。（假设我正确理解了 OP 的期望。）
确实，这不处理缺失值，但 +1 用于引用 pandas 中的块事物，我不知道。
好吧，该版本已修复以处理缺失值。

【解决方案2】：

试试：

import numpy
# read from csv into record array
df = numpy.genfromtxt('test.csv',delimiter=',', usecols=(1,2), skip_header=1, usemask=True)
# calc means on columns
ans = numpy.mean(dat, axis=0)

ans.data 将包含所有列均值的数组。

更新问题的编辑

如果你有一个 10G 的文件，你也可以用 numpy 对其进行分块。看到这个answer。

类似这样的：

sums = numpy.array((0,0))
counts = numpy.array((0,0))
fH = open('test.csv')
fH.readline() # skip header
while True:
    try:
        df = numpy.genfromtxt(itertools.islice(fH, 1000), delimiter=',', usecols=(1,2), usemask=True)
    except StopIteration:
        break       
    sums = sums + numpy.sum(df, 0)
    counts = counts + numpy.sum(df.mask == False, 0)
fH.close()
means = sums / counts

【讨论】：

:P 发现这很困难。以为genfromtxt 的意思是“文本生成器”并冻结了我的笔记本电脑。
@fakedrake，哈哈，对不起。玩genfromtxt 中的选项太有趣了我错过了你的更新...
我试过这个：` with open('test.csv') as f: dat = np.genfromtxt(iter(f), delimiter=',', skip_header=1, usecols=range( s,e), usemask=True) ret = np.mean(dat, axis=0).data` 就像您的链接所暗示的那样，它不断积累 ram（我杀死了 ita ata 大约 1G）。知道为什么吗？（编辑：对格式感到抱歉）
@fakedrake，请参阅上面的编辑，这应该让您非常接近。