【问题标题】:Fill in missing data from Queryset Django从 Queryset Django 中填写缺失的数据
【发布时间】:2020-05-12 02:35:00
【问题描述】:

我继承了一个使用 DjangoRestFramework 的 AngularJS / Django 应用程序和一个 Postgres DB,它正在从 AngularJS 重新平台化为 React / Redux。 我们正在尝试做的一件事是使用 amCharts4 呈现时间序列数据。我们遇到的一个问题(以及许多其他问题)是在数据库中可能没有条目的时间范围内呈现数据。例如,我们的结果可能类似于:

[
    {
        "date": "2020-01-16T00:00:00.000Z",
        "result": 3
    },
    {
        "date": "2020-01-18T00:00:00.000Z",
        "result": 2
    }
]

并希望它们看起来像:

[
    {
        "date": "2020-01-16T00:00:00.000Z",
        "result": 3
    },
    {
        "date": "2020-01-17T00:00:00.000Z",
        "result": 0
    },
    {
        "date": "2020-01-18T00:00:00.000Z",
        "result": 2
    }
]

此外,我们还有每个时间事件具有多个数据点的数据:

[
    {
        "date": "2020-01-13T00:00:00Z",
        "result": 1,
        "name": "Yes"
    },
    {
        "date": "2020-01-14T00:00:00Z",
        "result": 1,
        "name": "No"
    },
    {
        "date": "2020-01-16T00:00:00Z",
        "result": 1,
        "name": "No"
    }
]

并且希望在没有结果的任何日期为任何name 填充0 的数据:

[
    {
        "date": "2020-01-13T00:00:00Z",
        "result": 1,
        "name": "Yes"
    },
    {
        "date": "2020-01-13T00:00:00Z",
        "result": 0,
        "name": "No"
    },
    {
        "date": "2020-01-14T00:00:00Z",
        "result": 0,
        "name": "Yes"
    },
    {
        "date": "2020-01-14T00:00:00Z",
        "result": 1,
        "name": "No"
    },
    {
        "date": "2020-01-15T00:00:00Z",
        "result": 0,
        "name": "Yes"
    },
    {
        "date": "2020-01-15T00:00:00Z",
        "result": 0,
        "name": "No"
    },
    {
        "date": "2020-01-16T00:00:00Z",
        "result": 0,
        "name": "Yes"
    },
    {
        "date": "2020-01-16T00:00:00Z",
        "result": 1,
        "name": "No"
    }
]

这些结果的范围也不一定由日期中的开始和结束日期控制,但可以由用户指定。在这种情况下,我们需要为这些范围内的所有日期的所有选项填写零值结果。

我知道 amCharts skipEmptyPeriods 属性 (amCharts4 - skipEmptyPeriods),但我的前端工程师告诉我,这不适用于多条趋势线的情况(即第二种情况,即每条趋势线有多个选项)日期)。此外,这并不是真正的前端问题,而是会导致性能问题。

此外,我尝试将 Postgresql 的 generate_series 函数与 coalesce Postgresql - generate_series 一起使用,但无法使其适用于第二种情况。

目前我正在 Pandas 中尝试这个(我从未使用过),并解决了每个日期单个条目的第一个问题,但是,再次遇到每个日期多个条目的第二种情况:

    from_date = request.query_params.get("from_date")
    to_date = request.query_params.get("to_date")

    # let's do some zero plotting
    filtered_queryset = list(filtered_queryset)
    if from_date:
        from_date = datetime.strptime(from_date, "%Y-%m-%d").astimezone(pytz.UTC)
    else:
        from_date = filtered_queryset[0]["date"]
    if to_date:
        to_date = datetime.strptime(to_date, "%Y-%m-%d").astimezone(pytz.UTC)
        _now = localtime(now()).astimezone(pytz.UTC)
        to_date = min(to_date, _now)
    else:
        to_date = localtime(now()).astimezone(pytz.UTC)

    pandas_freq_map = {"day": "D", "week": "W-MON", "month": "MS"}
    freq = pandas_freq_map.get(request.query_params.get("frequency"))

    idx = pd.date_range(from_date.date(), to_date.date(), freq=freq)
    df = pd.DataFrame(list(filtered_queryset))
    datetime_series = pd.to_datetime(df["date"])
    datetime_index = pd.DatetimeIndex(datetime_series.values)

    df = df.set_index(datetime_index)
    df.drop("date", axis=1, inplace=True)
    df = df.asfreq(freq)
    df = df.reindex(idx, fill_value=0)
    df_json = json.JSONDecoder().decode(df.to_json(date_format="iso"))

    # this (result or 0) tomfoolery is bc I don't understand why pandas sometimes reindexes with null as the fill_value
    prepared_response = [{"date": date, "result": (result or 0)} for date, result in df_json["result"].items()]

【问题讨论】:

  • 如果您正在处理时间序列数据,您是否使用类似于TimeScale 的东西?这似乎需要一个专门的解决方案,而 Pandas 很可能是最好的临时解决方案。
  • @Jason 很遗憾,我没有时间或产品团队的批准来进行任何系统更改。然而。
  • 在我看来,如果您没有时间实施必要的专门更改,您可以将此作为进一步问题的指标。 IMO,为此使用 pandas 是一种 hack,不应该被视为一个好的解决方案。
  • 完全同意并且肯定会将此添加到我的证据库中,证明我们需要进行一些更改。虽然使用 Postgres 找到了解决方案。

标签: python django pandas postgresql django-rest-framework


【解决方案1】:

下面是尝试用 panda 创建一个解决方案。基本上你可以重新采样,然后用日期范围重新索引,但这对于复合索引来说有点笨拙

设置数据

import pandas as pd
data = [    { "date": "2020-01-16T00:00:00.000Z", "result": 3 }, 
            { "date": "2020-01-18T00:00:00.000Z", "result": 2 }, 
            { "date": "2020-01-13T00:00:00Z", "result": 1, "name": "Yes" }, 
            { "date": "2020-01-14T00:00:00Z", "result": 1, "name": "No" }, 
            { "date": "2020-01-16T00:00:00Z", "result": 1, "name": "No" }]

# build dataframe
df = pd.DataFrame(data )
df.name = df.name.fillna("No")
df.date = pd.to_datetime( df.date)

然后处理数据

# set up date range
idx = pd.date_range( df.date.min() , df.date.max() , freq="H")

# resample yes/no for name separately
df = df.set_index(["name", "date"]).sort_index()

no = df.loc["No"].resample( rule="60min").sum().reset_index()
no["Name"] = ["No"] * len(no)
no.set_index( ["Name", "date"], inplace=True)

yes = df.loc["Yes"].resample( rule="60min").sum().reset_index()
yes["Name"] = ["Yes"] * len(yes)
yes.set_index( ["Name", "date"], inplace=True)

# reindex with the full date range
yes = yes.reindex(pd.MultiIndex.from_arrays([["Yes"]*len(idx), idx], names=('Name', 'date')), fill_value=0)
no = no.reindex(pd.MultiIndex.from_arrays([["No"]*len(idx), idx], names=('Name', 'date')), fill_value=0)

# merge and create output (dateformat has to be adjusted)
df = pd.concat( [yes, no], axis=0)
df.reset_index().to_dict('records')

结果

[{'Name': 'Yes',
  'date': Timestamp('2020-01-13 00:00:00+0000', tz='UTC'),
  'result': 1},
 {'Name': 'Yes',
  'date': Timestamp('2020-01-13 01:00:00+0000', tz='UTC'),
  'result': 0}, ....
]

【讨论】:

    【解决方案2】:

    继续使用 Postgres 解决方案并确实找到了一个有效的查询:

    WITH
    unnested_select AS (
        SELECT unnest(forms_completedformfield.value_text_array) as unnested_array,
               date_trunc('day', created) as created
        FROM forms_completedformfield
        WHERE forms_completedformfield.completed_survey_id =
            ANY(
                ARRAY['815251ac-3891-4206-b876-d17898b74e66'::uuid, '74aea6f5-9860-4fe5-8820-68a279726c83'::uuid, '173ea91f-0dc8-4a6c-b330-7c3cee13e1b4'::uuid]
            )
        GROUP BY unnested_array,
                 created
    ),
    
    range_counts AS (
        SELECT date_trunc('day', unnested_select.created) as date,
               count(unnested_select.unnested_array) as ct,
               unnested_select.unnested_array as ar
        FROM unnested_select
        WHERE unnested_select.unnested_array =
            ANY(
                ARRAY['2b0076f1-7be5-4e52-9879-47e4eeafe175']
            ) 
        GROUP BY unnested_select.unnested_array,
                 unnested_select.created
    ),
    
    range_sums AS (
        SELECT date_trunc('day', unnested_select.created) as date,
               count(unnested_select.unnested_array) as ct
        FROM unnested_select
        GROUP BY unnested_select.created
    ),
    
    range_values AS ( 
        SELECT date_trunc('day', min(created)) as minval,
               date_trunc('day', max(created)) as maxval
        FROM unnested_select
    ),
    
    frequency_range AS (
        SELECT generate_series(minval, maxval, '1 day'::interval) as date
        FROM range_values
    ),
    
    field_options AS (
        SELECT
            DISTINCT unnested_select.unnested_array as ar,
            frequency_range.date
        FROM unnested_select
        CROSS JOIN frequency_range
    )
    
    SELECT  
            frequency_range.date as fd,
            field_options.ar as far,
            range_counts.ar as rar,
            range_counts.ct as ct
    FROM frequency_range
    LEFT OUTER JOIN field_options ON frequency_range.date = field_options.date
    LEFT OUTER JOIN range_counts ON frequency_range.date = range_counts.date and field_options.ar = range_counts.ar
    ORDER BY 
            frequency_range.date
    

    显然ARRAYs 中的硬编码值将被替换。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-05-02
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-08-01
      • 1970-01-01
      • 2014-03-21
      相关资源
      最近更新 更多