【问题标题】:How to add columns based on original column with string pattern?如何在具有字符串模式的原始列的基础上添加列?
【发布时间】:2021-04-29 09:55:47
【问题描述】:

df

         f
0   l2y_q1_eps_gg
1   l2y_q2_eps_gg
2   l2y_q3_eps_gg
3   l2y_q4_eps_gg
4   l1y_q1_eps_gg

目标

         f          fr_date
0   l2y_q1_eps_gg   20190331
1   l2y_q2_eps_gg   20190630
2   l2y_q3_eps_gg   20190930
3   l2y_q4_eps_gg   20191231
4   l1y_q1_eps_gg   20200331
5   cy_q1_eps_gg    20210331

fr_date 列的值是每年每个季度的最后一天,规则如下,fr_date 的类型为 int:

  • l2y:2019
  • l1y:2020
  • cy:2021
  • q1-q4:每个季度的最后一天

注意:

  • f 列的起始模式是l2y/l1y/cy+ q1/q2/q3/q4
  • 如果当前年份发生变化,规则也会发生变化。例如,如果当前年份是 2022,则 l2y→2020,l1y→2021,cy→2022。

【问题讨论】:

  • 请展示您的代码以及失败的地方

标签: python pandas


【解决方案1】:

您要求两件事:翻译函数,以及如何将此函数应用于 pandas Dataframe 的列以获取新列。

翻译功能

有几种方法可以做到,但这里是一种:

from datetime import datetime

# Last days of quarters are always the same
last_quarter_days = {"q1": "0331", "q2": "0630", "q3": "0930", "q4": "1231"}

def translate_date(string):
    # Extract year and quarter for the full string
    year_str, quarter_str, _, _, = string.split("_")
    # Compute year automatically
    current_year = datetime.today().year
    if year_str == "cy":
        year = current_year
    else:
        # This is a dumb extractor, you could do a pattern search
        # and raise an exception if the string is not correct
        sub = int(year_str[1])
        year = current_year - sub
    # Translate the quarter string thanks to the translation table
    day = last_quarter_days[quarter_str]
    # return the date as an integer (but maybe you want a string?)
    return int("{year}{day}".format(year=year, day=day))

这给出了:

>>> translate_date("cy_q1_eps_gg")                    
20210331

如何将其应用于您的数据框

用熊猫的map method

df["fr_date"] = df["f"].map(translate_date)

【讨论】:

    【解决方案2】:

    您可以使用QuarterEnd 偏移量来计算每个季度末的日期:

    current_year = pd.datetime.now().year
    
    mapping = {"l2y": current_year - 2, "l1y": current_year - 1, "cy": current_year}
    
    df["year"] = df.f.str.extract(r"([^_]+)")
    df["year"] = df["year"].map(mapping)
    df["quarter"] = df.f.str.extract(r"_q([\d])")
    
    df["fr_date"] = df.apply(
        lambda x: (
            pd.Timestamp(year=x["year"], month=int(x["quarter"]) * 3, day=1)
            + pd.tseries.offsets.QuarterEnd()
        ).strftime("%Y%m%d"),
        axis=1,
    )
    print(df[["f", "fr_date"]])
    

    印刷品(2021 年):

                   f   fr_date
    0  l2y_q1_eps_gg  20190331
    1  l2y_q2_eps_gg  20190630
    2  l2y_q3_eps_gg  20190930
    3  l2y_q4_eps_gg  20191231
    4  l1y_q1_eps_gg  20200331
    5   cy_q1_eps_gg  20210331
    

    【讨论】:

      【解决方案3】:
      df = pd.concat([df, df['f'].str.split('_', expand=True)], axis=1)
      df
                     f    0   1    2   3
      0  l2y_q1_eps_gg  l2y  q1  eps  gg
      1  l2y_q2_eps_gg  l2y  q2  eps  gg
      2  l2y_q3_eps_gg  l2y  q3  eps  gg
      3  l2y_q4_eps_gg  l2y  q4  eps  gg
      4  l1y_q1_eps_gg  l1y  q1  eps  gg
      
      df['year']=df[0].map({'l2y':'2019','l1y':'2020','cy':'2021'})
      df['quarter']=df[1].str.upper()
      df['fr_date'] = df['year'] + '-' + df['quarter']
      df = df.drop([0,1,2,3], axis=1)
      print(df)
                     f  year quarter  fr_date
      0  l2y_q1_eps_gg  2019      Q1  2019-Q1
      1  l2y_q2_eps_gg  2019      Q2  2019-Q2
      2  l2y_q3_eps_gg  2019      Q3  2019-Q3
      3  l2y_q4_eps_gg  2019      Q4  2019-Q4
      4  l1y_q1_eps_gg  2020      Q1  2020-Q1
      
      df['fr_date'] = pd.to_datetime([f'{x[:4]}{x[-2:]}' for x in df['fr_date']])
      df
                     f  year quarter    fr_date
      0  l2y_q1_eps_gg  2019      Q1 2019-01-01
      1  l2y_q2_eps_gg  2019      Q2 2019-04-01
      2  l2y_q3_eps_gg  2019      Q3 2019-07-01
      3  l2y_q4_eps_gg  2019      Q4 2019-10-01
      4  l1y_q1_eps_gg  2020      Q1 2020-01-01
      
      
      df['fr_date'] = pd.to_datetime(df['fr_date']) +  pd.tseries.offsets.QuarterEnd()
      df['fr_date'] = df['fr_date'].dt.strftime('%Y%m%d')
      df = df.drop(['year', 'quarter'], axis=1)
      print(df)
                     f   fr_date
      0  l2y_q1_eps_gg  20190331
      1  l2y_q2_eps_gg  20190630
      2  l2y_q3_eps_gg  20190930
      3  l2y_q4_eps_gg  20191231
      4  l1y_q1_eps_gg  20200331
      

      【讨论】:

        【解决方案4】:

        创建一个函数 change_string 并应用于列 f。该函数执行以下操作:

        • 创建包含年份映射的字典
        • 使用 regex 从字符串中提取 year code,然后使用 dictionary 从该代码中提取 year
        • 使用 regex 从字符串中提取 季度
        • 使用pd.Timestamp 使用月=季度*3日=1季度开始 /strong> 和 pd.tseries.offsets.QuarterEnd() 计算季度末。
        • 最后使用strftime理想字符串格式返回datetime
        def change_string(data):
            changes = {"cy": date.today().year, "l1y": date.today().year-1, "l2y": date.today().year-2}
            year = changes[re.findall("^l\dy", data)[0]]
            quarter = int(re.findall("_q(\d)", data)[0])
            data =  (pd.Timestamp(year=year, month =quarter * 3, day=1) + pd.tseries.offsets.QuarterEnd()).strftime("%Y%m%d")
            return data
        
        
        df = pd.DataFrame({"f":["l2y_q1_eps_gg","l2y_q2_eps_gg","l2y_q3_eps_gg","l2y_q4_eps_gg","l1y_q1_eps_gg"]})
        df["fr_date"] = df.f.apply(change_string)
        print(df)
        
        
                       f         fr_date
            0   l2y_q1_eps_gg   20190331
            1   l2y_q2_eps_gg   20190630
            2   l2y_q3_eps_gg   20190930
            3   l2y_q4_eps_gg   20191231
            4   l1y_q1_eps_gg   20200331
        
        
        

        【讨论】:

          猜你喜欢
          • 2011-07-30
          • 2015-07-06
          • 2018-10-07
          • 2018-11-27
          • 2018-09-07
          • 2013-04-24
          • 2021-06-20
          • 2020-08-28
          • 1970-01-01
          相关资源
          最近更新 更多