【问题标题】:Making Iterative Dates in a Pandas Dataframe在 Pandas 数据框中制作迭代日期
【发布时间】:2021-02-26 03:48:41
【问题描述】:

我有一个问题陈述如下:

在每个考试中心,考试分第一批和第二批两班倒组织(报到时间上午 9:00 和下午 2 点)。考试可以在 2020 年 12 月 1 日至 30 日期间在学区的任何一天进行,具体取决于学区的考生人数。请注意,每个学区只能有一个考试中心,一个班次最多可以出现 20 名学生。根据上述信息完成考试数据库的分配:

  • Rollno:候选卷号将从NL2000001开始(例如:NL2000001、NL2000002、NL2000003……)
  • cent_allot:通过输入考试城市代码分配中心
  • cent_add:将 NL "District Name" 作为每个位置的中心地址(例如,如果地区名称是 ADI,则中心添加是 NL ADI)
  • examDate:在 2020 年 12 月 1 日至 2020 年 12 月 30 日之间分配任何考试日期,保持最少考试天数且不违反上述任何条件
  • 批次:分配批次 I 或 II,确保上述所有条件
  • rep_time:第一批报告时间为上午 9 点,第二批报告时间为下午 2 点。

根据上面的描述,我需要制作一个满足上述条件的表格。我已经制作了 Rollnocent_allotcent_add 列,但我正在努力制作 examDate 列,因为对于每 40 个地区值,它应该具有相同的日期。

以下是地区列表及其出现频率:

Dist    Count
WGL     299
MAHB    289
KUN     249
GUN     198
KARN    196
KRS     171
CTT     169
VIZ     150
PRA     145
NALG    130
MED     128
ADI     123
KPM     119
TRI     107
ANA     107
KHAM    85
NEL     85
VIZI    84
EGOD    84
SOA     84
SIR     80
NIZA    73
PUD     70
KRK     69
WGOD    56

这是数据框的前 25 行:

Rollno     cent_allot   cent_add    examDate    batch   rep_time
NL2000001   WGL          NL WGL       NaN        NaN    NaN
NL2000002   WGL          NL WGL       NaN        NaN    NaN
NL2000003   WGL          NL WGL       NaN        NaN    NaN
NL2000004   KUN          NL KUN       NaN        NaN    NaN
NL2000005   KUN          NL KUN       NaN        NaN    NaN
NL2000006   KUN          NL KUN       NaN        NaN    NaN
NL2000007   GUN          NL GUN       NaN        NaN    NaN
NL2000008   GUN          NL GUN       NaN        NaN    NaN
NL2000009   GUN          NL GUN       NaN        NaN    NaN
NL2000010   GUN          NL GUN       NaN        NaN    NaN
NL2000011   VIZ          NL VIZ       NaN        NaN    NaN
NL2000012   VIZ          NL VIZ       NaN        NaN    NaN
NL2000013   VIZ          NL VIZ       NaN        NaN    NaN
NL2000014   VIZ          NL VIZ       NaN        NaN    NaN
NL2000015   MAHB         NL MAHB      NaN        NaN    NaN
NL2000016   MAHB         NL MAHB      NaN        NaN    NaN
NL2000017   MAHB         NL MAHB      NaN        NaN    NaN
NL2000018   WGOD         NL WGOD      NaN        NaN    NaN
NL2000019   WGOD         NL WGOD      NaN        NaN    NaN
NL2000020   WGOD         NL WGOD      NaN        NaN    NaN
NL2000021   WGOD         NL WGOD      NaN        NaN    NaN
NL2000022   EGOD         NL EGOD      NaN        NaN    NaN
NL2000023   EGOD         NL EGOD      NaN        NaN    NaN
NL2000024   EGOD         NL EGOD      NaN        NaN    NaN
NL2000025   EGOD         NL EGOD      NaN        NaN    NaN

最后 3 列都是 NaN,因为这 3 列尚未生成。

我们以WGL 为例。根据上述说明,每个区每班最多允许20名候选人,这意味着同一日期将分配给每个区40次,同一批次和相同的报告时间需要分配给每个区20次区。

有人知道怎么做吗?

【问题讨论】:

    标签: python pandas dataframe datetime data-science


    【解决方案1】:

    关键是先用.groupby().cumcount()获取流水号。 examDatebatch 随后可以分别由流水号对 40 和 20 的模数确定。

    数据

    使用每个Dist 的给定总计数生成随机行。

    import numpy as np
    import pandas as pd
    import io
    import datetime
    
    df_count = pd.read_csv(io.StringIO("""
    Dist    Count
    WGL     299
    MAHB    289
    KUN     249
    GUN     198
    KARN    196
    KRS     171
    CTT     169
    VIZ     150
    PRA     145
    NALG    130
    MED     128
    ADI     123
    KPM     119
    TRI     107
    ANA     107
    KHAM    85
    NEL     85
    VIZI    84
    EGOD    84
    SOA     84
    SIR     80
    NIZA    73
    PUD     70
    KRK     69
    WGOD    56
    """), sep=r"\s{2,}", engine="python")
    
    # generate random cent_allot
    df = df_count.loc[np.repeat(df_count.index.values, df_count["Count"]), "Dist"]\
        .sample(frac=1)\
        .reset_index(drop=True)\
        .to_frame()\
        .rename(columns={"Dist": "cent_allot"})
    
    df["Rollno"] = df.index.map(lambda s: f"NL2{s+1:06}")
    df["cent_add"] = df["cent_allot"].map(lambda s: f"NL {s}")
    

    df 到此为止应该与您所拥有的相似。

    代码

    # Assign the first examDate
    first_day = datetime.date(2020, 12, 1)
    
    # running no. grouped by "cent_allot" (i.e. "Dist")
    df["gp_no"] = df.groupby("cent_allot").cumcount()
    
    # increase one day for every 40 records
    df["examDate"] = df["gp_no"].apply(lambda x: first_day + datetime.timedelta(days=int(x / 40)))
    
    # batch - can be determined by the even-ness of int(no. / 20)
    df["batch"] = df["gp_no"].apply(lambda x: 1 + int(x / 20) % 2)
    
    # map batch to time (or "9 AM" / "2 PM" as you'd like)
    df["rep_time"] = df["batch"].apply(lambda x: datetime.time(9, 0) if x == 1 else datetime.time(14, 0))
    

    输出

    print(df[["Rollno", "cent_allot", "cent_add", "examDate", "batch", "rep_time"]])
    
             Rollno cent_allot cent_add    examDate  batch  rep_time
    0     NL2000001        CTT   NL CTT  2020-12-01      1  09:00:00
    1     NL2000002       MAHB  NL MAHB  2020-12-01      1  09:00:00
    2     NL2000003        CTT   NL CTT  2020-12-01      1  09:00:00
    3     NL2000004        SOA   NL SOA  2020-12-01      1  09:00:00
    4     NL2000005        PUD   NL PUD  2020-12-01      1  09:00:00
             ...        ...      ...         ...    ...       ...
    3345  NL2003346       KHAM  NL KHAM  2020-12-03      1  09:00:00
    3346  NL2003347        ADI   NL ADI  2020-12-04      1  09:00:00
    3347  NL2003348       KARN  NL KARN  2020-12-05      2  14:00:00
    3348  NL2003349        SIR   NL SIR  2020-12-02      2  14:00:00
    3349  NL2003350        ADI   NL ADI  2020-12-04      1  09:00:00
    
    [3350 rows x 6 columns]
    

    【讨论】:

    • 感谢您的代码。您的代码看起来简洁而恰当。我没有运行您的代码,因为我自己想出了一个乏味的解决方案,但我仍然非常感谢旅游帮助。感谢您的帮助...
    【解决方案2】:

    我为获得解决方案付出了很多努力,但最终在那天结束时,当我问这个问题时,我找到了解决方案:

    # examDate column
    
    n_stud = 20   # mention the number of students per batch here
    n_batch = 2   # mention the number of batches per day here
    
    temp = data['TH_CENT_CH'].value_counts().sort_index().reset_index()  # storing centers and their counts in a temp variable
    cent = temp['index'].to_list()      # storing centers in a list
    cnt = temp['TH_CENT_CH'].to_list()  # storing counts in a list
    cent1 = []
    cnt1 = []
    j = 0
    
    # for loops to repeat each center by count times
    for c in cent:
        for i in range(1, cnt[j] + 1):
            cent1.append(c)
            cnt1.append(i)
        j += 1
    
    df1 = pd.DataFrame(list(zip(cent1, cnt1)), columns = ['cent','cnt'])  # dataframe to store the centers and new count list
    
    counts = df1['cnt'].to_list() # storing the new counts in a list
    helper = {}  # helper dictionary
    max_no = max(cnt)
    
    # for-while loops to map helper number to each counts number
    for i in counts:
        j = 0
        while(j < (round(max_no / (n_stud * n_batch)) + 1)):
            if((i > (n_stud * n_batch * j)) & (i < (n_stud * n_batch * (i + 1)))):
                helper[i] = j
            j += 1
    
    # mapping the helper with counts
    counts = pd.Series(counts)
    helper = pd.Series(helper)
    hel = counts.map(helper).to_list()
    df1['helper'] = hel
    
    examDate = {}  # dictionary to store exam dates
    
    # for loop to map dates to each helper number
    for i in hel:
        examDate[i] = pd.to_datetime(date(2020, 12, 1) + timedelta(days = (2 * i)))
    
    # mapping the dates with helpers
    hel = pd.Series(hel)
    examDate = pd.Series(examDate)
    exam = hel.map(examDate).to_list()
    df1['examDate'] = exam
            
    # adding the dates to the original dataframe
    examDate = df1['examDate'].to_list()
    data['examDate'] = examDate
    data['examDate']
    

    这里TH_CENT_CH指的是原始数据框中的区列。当我运行data.head() 时,我得到了我需要的输出,即 40 名学生的一次约会。我对剩下的两列做了类似的事情,我需要为 20 名学生提供相同的批次。所以我得到的输出是:

            Rollno  cent_allot  cent_add  examDate   batch  rep_time
    0     NL2000001        ADI   NL ADI  2020-12-01      1  09:00:00
    1     NL2000002        ADI   NL ADI  2020-12-01      1  09:00:00
    2     NL2000003        ADI   NL ADI  2020-12-01      1  09:00:00
    3     NL2000004        ADI   NL ADI  2020-12-01      1  09:00:00
    4     NL2000005        ADI   NL ADI  2020-12-01      1  09:00:00
             ...        ...      ...         ...    ...       ...
    3345  NL2003346        WGOD  NL WGOD 2020-12-03      1  09:00:00
    3346  NL2003347        WGOD  NL WGOD 2020-12-04      1  09:00:00
    3347  NL2003348         KRS  NL KRS  2020-12-05      1  09:00:00
    3348  NL2003349        WGOD  NL WGOD 2020-12-02      1  09:00:00
    3349  NL2003350        WGOD  NL WGOD 2020-12-04      1  09:00:00
    

    请找出其余两列的代码:

    # batch column
    
    counts = df1['cnt'].to_list()  # storing the new counts in a list
    helper2 = {}  # helper dictionary
    
    # for-while loops to map helper number to each counts number
    for i in counts:
        j = 0
        while(j < (round(max_no / (n_stud)) + 1)):
            if((i > (n_stud * j)) & (i < (n_stud * (i + 1)))):
                helper2[i] = j
            j += 1
    
    # mapping the helper with counts
    counts = pd.Series(counts)
    helper2 = pd.Series(helper2)
    hel2 = counts.map(helper2).to_list()
    df1['helper2'] = hel2
    
    batch = {}   # dictionary to store batch numbers
    
    # for loop to map batch numbers to each helper number
    for i in hel2:
        if(i % 2 == 0):
            batch[i] = 1
        else:
            batch[i] = 2
            
    # mapping the batches with helpers
    hel2 = pd.Series(hel2)
    batch = pd.Series(batch)
    bat = hel2.map(batch).to_list()
    df1['batch'] = bat
    
    # adding the batches to the original dataframe
    batch = df1['batch'].to_list()
    data['batch'] = batch
    data['batch'].unique()
    
    # rep_time column
    data.loc[data['batch'] == 1, 'rep_time'] = '9:00 AM'
    data.loc[data['batch'] == 2, 'rep_time'] = '2:00 PM'
    data['rep_time'].unique()
    

    【讨论】:

      猜你喜欢
      • 2020-10-29
      • 2020-10-08
      • 2018-10-01
      • 2019-10-08
      • 2019-01-14
      • 2021-06-21
      • 2017-01-04
      • 1970-01-01
      • 2023-01-14
      相关资源
      最近更新 更多