【发布时间】:2018-03-26 20:26:37
【问题描述】:
我有一个包含 3.5 亿行、3 列的数据框
要求:
我想使用更少的内存根据管道符号将DESCRIPTION列拆分为LIST
input_df.head():
startTime DESCRIPTION Response_Time
1504212340 Business Transaction Performance|Business Transactions|Hexa|mBanking Confirmation.(Confirmation.aspx).no|Average Response Time (ms)_value 6
1504212340 Business Transaction Performance|Business Transactions|Hexa|mBanking Frontpage.ci|Average Response Time (ms)_value 4
1504202341 Business Transaction Performance|Business Transactions|Hexa|mBanking Fonto KTList GenericNS.(GenericNS).dk|Average Response Time (ms)_value 5
1504202341 Business Transaction Performance|Business Transactions|Hexa|mBanking Transaction Overview.co|Average Response Time (ms)_value 5
1504202342 Business Transaction Performance|Business Transactions|Hexa|mBanking Logon.(BidError.aspx).no|Average Response Time (ms)_value 8
desired_output:
startTime list_Description Response_Time
1504212340 ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Confirmation.(Confirmation.aspx).no', 'Average Response Time (ms)_value'] 6
1504212340 ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Frontpage.ci', 'Average Response Time (ms)_value'] 4
1504202341 ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Fonto KTList GenericNS.(GenericNS).dk', 'Average Response Time (ms)_value'] 5
1504202341 ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Transaction Overview.co', 'Average Response Time (ms)_value'] 5
1504202342 ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Logon.(BidError.aspx).no', 'Average Response Time (ms)_value'] 8
我的代码:
import pandas as pd
import glob
path = r'C:/Users/IBM_ADMIN/Desktop/Delete/Source/app_dynamics/*' #500 csv files in this location
all_files = glob.glob(path)
#Get the input files and concatenate
generator = (pd.read_csv(f, delimiter='\t', dtype=float) for f in all_files) #Using parentheses returns a generator instead of a list, mentioning 'dtype=float' helps to use less memory
input_df = pd.concat(generator , ignore_index=True) #results in 350 million rows , 3 columns
input_df['list_Description'] = input_df['DESCRIPTION'].str.split('|') #Splitting the string into list
我的代码的缺点
上述代码适用于数据帧中较少的行数。但是如果我将它应用于 3.5 亿行,我的内存会立即达到 98% 并且系统会挂起。
csv 可能有帮助.. 但是
如果我在 csv 文件中有“input_df”,则可以分块处理(顺便说一句,在这种情况下,我不想将“input_df”写入 csv :-))。由于上面的“input_df”是一个数据框,我不知道如何开始。如果有办法直接在数据帧上使用块大小会很好
请问有人可以提供一个更好的主意来避免内存问题吗?
【问题讨论】:
-
这是一个长镜头,但您的专栏中有多少独特的价值。你能发布输出:
input_df.DESCRIPTION.nunique() -
input_df.DESCRIPTION.nunique() 输出[43]: 3445
标签: python pandas dataframe memory-management chunks