【发布时间】:2015-07-26 13:31:13
【问题描述】:
我正在消化几个 csv 文件(每个文件都有一年或一年以上的数据),以将医疗分类为广泛的类别,同时仅保留原始信息的子集,甚至汇总到每月的数字(按 AR=年和月)每人的治疗次数(LopNr)。许多治疗同时属于不同的类别(多个诊断代码列在 csv 的相关列中,因此我将该字段分成一列列表,并按属于 ICD-9 相关范围的任何诊断代码对行进行分类代码)。
我正在使用 IOPro 来节省内存,但我仍然遇到了段错误(仍在调查中)。每个文本文件有几个 GB,但是这台机器有 256 GB RAM。要么其中一个包有问题,要么我需要一个内存效率更高的解决方案。
我在 Linux 下使用版本 pandas 0.16.2 np19py26_0、iopro 1.7.1 np19py27_p0 和 python 2.7.10 0。
所以原始数据看起来像这样:
LopNr AR INDATUMA DIAGNOS …
1 2007 20070812 C32 F17
1 2007 20070816 C36
我希望看到这样的聚合:
LopNr AR month tobacco …
1 2007 8 2
顺便说一句,我最终需要 Stata dta 文件,但我通过 cvs 因为 pandas.DataFrame.to_stata 在我的经验中似乎很不稳定,但也许我也遗漏了一些东西。
# -*- coding: utf-8 -*-
import iopro
import numpy as np
from pandas import *
all_treatments = DataFrame()
filelist = ['oppenvard20012005','oppenvard20062010','oppenvard2011','oppenvard2012','slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']
tobacco = lambda lst: any( (((x >= 'C30') and (x<'C40')) or ((x >= 'F17') and (x<'F18'))) for x in lst)
nutrition = lambda lst: any( (((x >= 'D50') and (x<'D54')) or ((x >= 'E10') and (x<'E15')) or ((x >= 'E40') and (x<'E47')) or ((x >= 'E50') and (x<'E69'))) for x in lst)
mental = lambda lst: any( (((x >= 'F') and (x<'G')) ) for x in lst)
alcohol = lambda lst: any( (((x >= 'F10') and (x<'F11')) or ((x >= 'K70') and (x<'K71'))) for x in lst)
circulatory = lambda lst: any( (((x >= 'I') and (x<'J')) ) for x in lst)
dental = lambda lst: any( (((x >= 'K02') and (x<'K04')) ) for x in lst)
accident = lambda lst: any( (((x >= 'V01') and (x<'X60')) ) for x in lst)
selfharm = lambda lst: any( (((x >= 'X60') and (x<'X85')) ) for x in lst)
cancer = lambda lst: any( (((x >= 'C') and (x<'D')) ) for x in lst)
endonutrimetab = lambda lst: any( (((x >= 'E') and (x<'F')) ) for x in lst)
pregnancy = lambda lst: any( (((x >= 'O') and (x<'P')) ) for x in lst)
other_stress = lambda lst: any( (((x >= 'J00') and (x<'J48')) or ((x >= 'L20') and (x<'L66')) or ((x >= 'K20') and (x<'K60')) or ((x >= 'R') and (x<'S')) or ((x >= 'X86') and (x<'Z77'))) for x in lst)
for file in filelist:
filename = 'PATH' + file +'.txt'
adapter = iopro.text_adapter(filename,parser='csv',field_names=True,output='dataframe',delimiter='\t')
treatments = adapter[['LopNr','AR','DIAGNOS','INDATUMA']][:]
treatments['month'] = treatments['INDATUMA'] % 10000
treatments['day'] = treatments['INDATUMA'] % 100
treatments['month'] = (treatments['month']-treatments['day'])/100
del treatments['day']
diagnoses = treatments['DIAGNOS'].str.split(' ')
del treatments['DIAGNOS']
treatments['tobacco'] = diagnoses.map(tobacco)
treatments['nutrition'] = diagnoses.map(nutrition)
treatments['mental'] = diagnoses.map(mental)
treatments['alcohol'] = diagnoses.map(alcohol)
treatments['circulatory'] = diagnoses.map(circulatory)
treatments['dental'] = diagnoses.map(dental)
treatments['accident'] = diagnoses.map(accident)
treatments['selfharm'] = diagnoses.map(selfharm)
treatments['cancer'] = diagnoses.map(cancer)
treatments['endonutrimetab'] = diagnoses.map(endonutrimetab)
treatments['pregnancy'] = diagnoses.map(pregnancy)
treatments['other_stress'] = diagnoses.map(other_stress)
all_treatments = all_treatments.append(treatments)
all_treatments = all_treatments.groupby(['LopNr','AR','month']).aggregate(np.count_nonzero) #.sum()
all_treatments = all_treatments.astype(int,copy=False,raise_on_error=False)
all_treatments.to_csv('PATH.csv')
【问题讨论】:
-
您正在对函数进行大量范围检查。您可以将
(x >= 'C30') and (x < 'C40')之类的内容简化为('C30' <= x < 'C40')。 -
另外,像
((x >= 'O') and (x < 'P'))这样的东西可以简化为x.startswith('O')。 -
请注意,我可以通过避免使用 IOPro 来避免段错误。尽管如此,答案的所有其他改进都极大地改进了代码。