需要有关加快数据清理 python 代码的建议答案

【问题标题】：Need advice on speeding up the python code on data cleaning需要有关加快数据清理 python 代码的建议
【发布时间】：2017-06-18 23:29:56
【问题描述】：

我正在使用 python notebook (jupyter) 运行一个辅助数据分析项目。数据集有约 1.3 行，我要做的第一件事是从数据集中的“日期”列中提取日、月和年。我写的代码执行得很好，只是它需要很长时间。我估计完成数据处理过程可能需要一个半小时。我想知道是否有人可以对我的代码提出一些建议以提高速度？

import csv
from datetime import datetime

def date_split(calendar):
    new_calendar={}
    i=0
    calendar_total=pd.DataFrame()
    num=calendar.shape[0]-1
    while i<=10000:

        tem=calendar_data.iloc[i,1]
        #extract year&month&day from day column
        listdate=datetime.strptime(tem,'%Y-%m-%d')
        new_calendar['Year']=listdate.year
        new_calendar['Month']=listdate.month
        new_calendar['Date']=listdate.day
        # add the other columns
        new_calendar['listId']=calendar.iloc[i,0]
        new_calendar['available']=calendar.iloc[i,2]
        new_calendar['price']=calendar.iloc[i,3]
        new_calendar=pd.DataFrame.from_records(new_calendar,index=[i])
        #change new_calendar data type from dic to pd dataframe        
        calendar_total=calendar_total.append(new_calendar)
        i=i+1

     return calendar_total

同样，目标是从“日”列中提取年/月/日，并将它们制成新的数据框。在本地运行 python 中的代码也会显着加快速度吗？

谢谢

【问题讨论】：

您是否通过分析等发现了代码中的任何特定瓶颈？

标签： python algorithm pandas data-analysis data-science

【解决方案1】：

这就是我将年、月和日从现有数据框中提取到新数据框中的方法：

import numpy as np
import pandas as pd

df = pd.DataFrame({'date' : pd.date_range("19970202", periods=365*20)})

df2 = pd.DataFrame({'year' : df['date'].dt.year, 'month' : df['date'].dt.month, 'day' : df['date'].dt.day})

print (df)
print (df2)

我尚未针对大型数据集（130 万行？）对此进行测试，但也许这可以加快速度。

【讨论】：

您好 Johannesmik，感谢您的解决方案。但是，就我而言，我需要将原始数据分解为类似日期时间的对象，然后才能使用您的方法。我的数据框中的日期就像“2007-10-09”，我正在使用 strptime 函数来破坏它们，python 需要很长时间才能通过 130 万行。无论如何，请随时分享您的想法，感谢您的见解，真的很有帮助:)
在你的代码中花费很长时间的是 100 万个追加。您还可以使用 pd.to_datetime 将字符串列（如“2007-10-09”）转换为日期时间列。例如，您可以创建一个保存日期时间值的临时数据框，如下所示：df2 = pd.DataFrame({'A' : df['A'], 'B' : pd.to_datetime(df['B']), 'c' : df['C']})（其中 df['B'] 保存日期格式的字符串。）然后您可以使用类似于我的答案的代码来创建一个保存行的数据框月、日等
确实，我相信追加需要很长时间。后来我发现使用 pandas 内置函数 to_datetime 真的很有帮助。我使用以下代码提取年份值：split['year']=pd.to_datetime(calendar_data['date']).dt.year。给我结果花了不到 30 秒。无论如何，非常感谢您指出我的代码中的瓶颈，非常感谢:) .