【问题标题】:Data wrangling into a time-series format数据整理成时间序列格式
【发布时间】:2021-04-13 12:57:03
【问题描述】:

在这里你会找到示例 csv 文件,下面是我使用的 python 代码。

公司、州、产品、年份、1 月、2 月、3 月、4 月、5 月、6 月、7 月、8 月、9 月、10 月、11 月、12 月 Fred,NJ,A,2017,111,7,69,152,192,218,246,98,329,59,191,43 Fred,NJ,A,2018,391,317,241,115,40,300,295,288,89,178,73,110 Fred,NJ,A,2019,345,30,271,78,10,45,185,349,237,70,197,389 George,AZ,B,2017,260,55,258,389,218,7,263,241,191,398,266,42 George,AZ,B,2018,260,10,256,399,216,356,164,10,256,399,231,287 George,AZ,B,2019,173,10,360,45,16,297,25,1,134,35,294,193 杰米,纽约,C,2017,360,18,127,96,175,63,40,194,118,332,128,26 杰米,纽约,C,2018,70,175,301,259,160,89,3,314,115,115,117,281 杰米,纽约,C,2019,259,11,22,122,191,7,123,378,14,327,95,325 芭芭拉,UT,D,2017,338,,278,176,80,338,268,52,240,383,217,225 芭芭拉,UT,D,2018,316,242,129,158,238,97,234,266,62,223,287,21 芭芭拉,UT,D,2019,378,130,106,70,92,77,389,189,140,​​126,193,188 GETTY,CA,E,2017,50,5,77,47,8,169,163,58,324,186,208,57 GETTY,CA,E,2018,53,92,177,111,11,238,96,85,129,396,75,84 GETTY,CA,E,2019,355,,254,303,195,353,141,168,97,30,25,310 芭芭拉,UT,F,2017,167,192,37,53,390,357,340,48,389,202,92,51 芭芭拉,UT,F,2018,177,265,359,350,318,359,359,93,393,251,51,255 芭芭拉,UT,F,2019,35,302,234,306,156,13,122,354,0,214,81,86 布鲁斯,OR,A,2017,146,231,246,246,150,82,302,249,225,213,394,140 布鲁斯,OR,A,2018,244,385,36,86,270,381,309,98,163,321,337,139 布鲁斯,OR,A,2019,183,331,337,188,235,308,10,130,143,31,193,4 GETTY,CA,B,2017,152,273,60,162,68,53,334,204,238,262,220,223 GETTY,CA,B,2018,161,221,286,80,182,261,387,208,165,303,220,101 GETTY,CA,B,2019,310,218,332,349,144,3,2,82,132,199,375,257

鉴于每个客户每年都被记录在单独的记录中,并且 1 月至 12 月。我首先创建了分散数据的代码。我遇到问题的部分是将新数据放入这样的时间序列布局中。

日期、单位、公司、州、产品

我正在考虑使用 pandas merk 或 stack

让我知道你的想法。

Python 代码

import pandas as pd

df = pd.read_csv ('Product.csv')
print (df)

df.loc[(df['Year']==2017), '01-01-2017'] =  (df['Jan'])
df.loc[(df['Year']==2017), '02-01-2017'] =  (df['Feb'])
df.loc[(df['Year']==2017), '03-01-2017'] =  (df['Mar'])
df.loc[(df['Year']==2017), '04-01-2017'] =  (df['Apr'])
df.loc[(df['Year']==2017), '05-01-2017'] =  (df['May'])
df.loc[(df['Year']==2017), '06-01-2017'] =  (df['Jun'])      
df.loc[(df['Year']==2017), '07-01-2017'] =  (df['Jul'])
df.loc[(df['Year']==2017), '08-01-2017'] =  (df['Aug'])
df.loc[(df['Year']==2017), '09-01-2017'] =  (df['Sep'])
df.loc[(df['Year']==2017), '10-01-2017'] =  (df['Oct'])
df.loc[(df['Year']==2017), '11-01-2017'] =  (df['Nov'])
df.loc[(df['Year']==2017), '12-01-2017'] =  (df['Dec'])

df.loc[(df['Year']==2018), '01-01-2018'] =  (df['Jan'])
df.loc[(df['Year']==2018), '02-01-2018'] =  (df['Feb'])
df.loc[(df['Year']==2018), '03-01-2018'] =  (df['Mar'])
df.loc[(df['Year']==2018), '04-01-2018'] =  (df['Apr'])
df.loc[(df['Year']==2018), '05-01-2018'] =  (df['May'])
df.loc[(df['Year']==2018), '06-01-2018'] =  (df['Jun'])      
df.loc[(df['Year']==2018), '07-01-2018'] =  (df['Jul'])
df.loc[(df['Year']==2018), '08-01-2018'] =  (df['Aug'])
df.loc[(df['Year']==2018), '09-01-2018'] =  (df['Sep'])
df.loc[(df['Year']==2018), '10-01-2018'] =  (df['Oct'])
df.loc[(df['Year']==2018), '11-01-2018'] =  (df['Nov'])
df.loc[(df['Year']==2018), '12-01-2018'] =  (df['Dec'])

df.loc[(df['Year']==2019), '01-01-2019'] =  (df['Jan'])
df.loc[(df['Year']==2019), '02-01-2019'] =  (df['Feb'])
df.loc[(df['Year']==2019), '03-01-2019'] =  (df['Mar'])
df.loc[(df['Year']==2019), '04-01-2019'] =  (df['Apr'])
df.loc[(df['Year']==2019), '05-01-2019'] =  (df['May'])
df.loc[(df['Year']==2019), '06-01-2019'] =  (df['Jun'])      
df.loc[(df['Year']==2019), '07-01-2019'] =  (df['Jul'])
df.loc[(df['Year']==2019), '08-01-2019'] =  (df['Aug'])
df.loc[(df['Year']==2019), '09-01-2019'] =  (df['Sep'])
df.loc[(df['Year']==2019), '10-01-2019'] =  (df['Oct'])
df.loc[(df['Year']==2019), '11-01-2019'] =  (df['Nov'])
df.loc[(df['Year']==2019), '12-01-2019'] =  (df['Dec']) 

print (df)  

【问题讨论】:

  • 欢迎来到 SO。请描述您正在尝试做什么,并提供示例预期输出。所以我们可以提供帮助

标签: python time-series data-wrangling


【解决方案1】:

下面是我的做法:

from io import StringIO

df_ts = pd.read_csv(StringIO("""Company,State,Product,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec 
Fred,NJ,A,2017,111,7,69,152,192,218,246,98,329,59,191,43 
Fred,NJ,A,2018,391,317,241,115,40,300,295,288,89,178,73,110 
Fred,NJ,A,2019,345,30,271,78,10,45,185,349,237,70,197,389 
George,AZ,B,2017,260,55,258,389,218,7,263,241,191,398,266,42 
George,AZ,B,2018,260,10,256,399,216,356,164,10,256,399,231,287 
George,AZ,B,2019,173,10,360,45,16,297,25,1,134,35,294,193 
Jamie,NY,C,2017,360,18,127,96,175,63,40,194,118,332,128,26 
Jamie,NY,C,2018,70,175,301,259,160,89,3,314,115,115,117,281 
Jamie,NY,C,2019,259,11,22,122,191,7,123,378,14,327,95,325 
Barbara,UT,D,2017,338,,278,176,80,338,268,52,240,383,217,225 
Barbara,UT,D,2018,316,242,129,158,238,97,234,266,62,223,287,21 
Barbara,UT,D,2019,378,130,106,70,92,77,389,189,140,126,193,188 
GETTY,CA,E,2017,50,5,77,47,8,169,163,58,324,186,208,57 
GETTY,CA,E,2018,53,92,177,111,11,238,96,85,129,396,75,84 
GETTY,CA,E,2019,355,,254,303,195,353,141,168,97,30,25,310 
Barbara,UT,F,2017,167,192,37,53,390,357,340,48,389,202,92,51 
Barbara,UT,F,2018,177,265,359,350,318,359,359,93,393,251,51,255 
Barbara,UT,F,2019,35,302,234,306,156,13,122,354,0,214,81,86 
Bruse,OR,A,2017,146,231,246,246,150,82,302,249,225,213,394,140 
Bruse,OR,A,2018,244,385,36,86,270,381,309,98,163,321,337,139 
Bruse,OR,A,2019,183,331,337,188,235,308,10,130,143,31,193,4 
GETTY,CA,B,2017,152,273,60,162,68,53,334,204,238,262,220,223 
GETTY,CA,B,2018,161,221,286,80,182,261,387,208,165,303,220,101 
GETTY,CA,B,2019,310,218,332,349,144,3,2,82,132,199,375,257"""))



df_ts.melt(
    id_vars=["Company", "State", "Product", "Year"],
    var_name="month",
    value_name="demand"
)

这会生成:

    Company State   Product Year    month   demand
0   Fred    NJ  A   2017    Jan 111.0
1   Fred    NJ  A   2018    Jan 391.0
2   Fred    NJ  A   2019    Jan 345.0
3   George  AZ  B   2017    Jan 260.0
4   George  AZ  B   2018    Jan 260.0
... ... ... ... ... ... ...
283 Bruse   OR  A   2018    Dec 139.0
284 Bruse   OR  A   2019    Dec 4.0
285 GETTY   CA  B   2017    Dec 223.0
286 GETTY   CA  B   2018    Dec 101.0
287 GETTY   CA  B   2019    Dec 257.0

之后你就可以进行时间操作了,像这样:

df_ts = df_ts.set_index(df_ts.apply(lambda x: pd.to_datetime(f"{x.Year}-{x.month}"), axis=1))
df_ts = df_ts.drop(["Year", "month"], axis=1)
df_ts

剩下的就是:

          Company   State   Product demand
2017-01-01  Fred    NJ  A   111.0
2018-01-01  Fred    NJ  A   391.0
2019-01-01  Fred    NJ  A   345.0
2017-01-01  George  AZ  B   260.0
2018-01-01  George  AZ  B   260.0
... ... ... ... ...
2018-12-01  Bruse   OR  A   139.0
2019-12-01  Bruse   OR  A   4.0
2017-12-01  GETTY   CA  B   223.0
2018-12-01  GETTY   CA  B   101.0
2019-12-01  GETTY   CA  B   257.0

之后就可以随意操作了。

【讨论】:

  • 很好,请接受它作为答案;)。谢谢
猜你喜欢
  • 2020-05-20
  • 2020-01-15
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-11-20
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多