【问题标题】:postgresql, get list of items with nearest point where condition is truepostgresql,获取条件为真的最近点的项目列表
【发布时间】:2016-10-16 18:46:03
【问题描述】:

编辑:跳到当前状态的最后一次编辑

你好!

我有一张带气象站的桌子

车站:

id,
point, (geometry(Point,4326))
ctry (country code)

还有一张带有天气数据的表格:

诺阿:

id                 | integer                     | not null    default    nextval('noaa_id_seq'::regclass)
usaf_wban          | text                        |
station_id         | integer                     |
usaf               | integer                     |
wban               | integer                     |
dt                 | timestamp without time zone | not null
point              | geometry(Point,4326)        |
air_temp           | double precision            |
dew_point          | double precision            |
relative_humidity  | double precision            |
sea_level_pressure | double precision            |
pressure           | double precision            |
wind               | double precision            |
cloudiness         | double precision            |
ghi                | double precision            |

还有另一个locations_location,我明白了

我已经对索引进行了很多实验,noaa 表上的当前索引是:

Indexes:
"noaa_pkey" PRIMARY KEY, btree (id)
"noaa_dt_trunc" btree (date_trunc('hour'::text, dt))
"noaa_point" gist (point)
"noaa_station_ids" btree (station_id)

现在我想为每个参数选择(air_temp,wind ..) 此参数不为空且不为 9999 的最近点

此时我使用了 5 个如下所示的单个查询:

 with postal_station AS (
        SELECT id as station_id, s.point FROM stations s WHERE s.ctry = 'AU'
        ORDER BY s.point <-> (
            SELECT point FROM locations_location l
            WHERE l.postal_code = '9201' AND l.country_code = 'AT'
            LIMIT 1
        )
        LIMIT 5
    )
    SELECT
        DISTINCT ON (date_trunc('hour', dt))
        date_trunc('hour', dt) as dt,
        cloudiness
    FROM
        noaa n
    WHERE
        dt BETWEEN '2010-01-01'::timestamp AND '2015-01-01'::timestamp
        AND
        NOT cloudiness = 9999
        AND
        NOT cloudiness is null
        AND
        n.station_id IN (SELECT station_id FROM postal_station)
    ORDER BY dt, point <-> ( SELECT point FROM postal_station LIMIT 1 )

这相当快~150ms,唯一使用的索引是 noaa_station_ids

但目前我将 station_ids 的限制增加到 5 左右:

with postal_station AS (
        SELECT id as station_id, s.point FROM stations s WHERE s.ctry = 'AU'
        ORDER BY s.point <-> (
            SELECT point FROM locations_location l
            WHERE l.postal_code = '9201' AND l.country_code = 'AT'
            LIMIT 1
        )
        LIMIT 6
    )
    SELECT
        DISTINCT ON (date_trunc('hour', dt))
        date_trunc('hour', dt) as dt,
        air_temp
    FROM
        noaa n
    WHERE
        dt BETWEEN '2010-01-01'::timestamp AND '2015-01-01'::timestamp
        AND
        NOT air_temp = 9999
        AND
        NOT air_temp is null
        AND
        n.station_id IN (SELECT station_id FROM postal_station)
    ORDER BY dt, point <-> ( SELECT point FROM postal_station LIMIT 1 )

https://explain.depesz.com/s/9n2M

索引 noaa_station_ids 不再被使用,查询大约需要 ~2429 毫秒

所以这是我的问题:

  • 如果“n.station_id IN”子句包含超过 5 个值,为什么不使用索引 noaa_station_ids?

  • 有没有办法在合理的时间内在一个查询中选择所有需要的值?

感谢您的阅读:)

PS:启用了 postgis 的 Postgres 9.5

编辑:实际上 cte 应该看起来像这样以获得正确的订购点..但这是一个附带的事情

with postal_point AS (
        SELECT point FROM locations_location l
        WHERE l.postal_code = '9201' AND l.country_code = 'AT'
        LIMIT 1
    ),
    postal_station AS (
        SELECT id as station_id, s.point FROM stations s WHERE s.ctry = 'AU'
        ORDER BY s.point <-> ( SELECT point FROM postal_point )
        LIMIT 5
    )

编辑:在 freenode 上加入 #postgresql 后 RhodiumToad 帮助我构建了这个查询

with postal_station AS (
        select
            s1.*
        from (
            select point from locations_location l where l.postal_code = '9201' AND l.country_code = 'AT' limit 1
        ) l0,
        lateral (
            select s.id, rank() over (order by s.point <-> l0.point)
            from
            stations s
            where
            s.ctry = 'AU'
        order by s.point <-> l0.point limit 20) s1
    )
    SELECT
        DISTINCT ON (date_trunc('hour', dt))
        date_trunc('hour', dt) as dt,
        air_temp
    FROM
        noaa n
    JOIN
        postal_station p
        ON
        p.id = n.station_id
    WHERE
        dt BETWEEN '2010-01-01'::timestamp AND '2015-01-01'::timestamp
        AND
        NOT air_temp = 9999
        AND
        NOT air_temp is null
    ORDER BY dt, p.rank

即使有更多站点,速度也快约 200 毫秒 => https://explain.depesz.com/s/kA8

我会在几天后将此帖子标记为已回答。

仍然欢迎优化。

【问题讨论】:

  • 注意:noaa table 的定义不包含dtstation_id 列。请将您的表格的真实表格定义添加到您的问题中。

标签: postgresql


【解决方案1】:
1) Why is the index noaa_station_ids not used if the "n.station_id IN" clause contains more then 5 values ?

2) Is there a way to select all needed values in one query in reasonable time ?

1) 将 cpu_tuple_cost 增加到 0.1 后,索引也用于更多站点,但随着站点数量的增加,查询仍然变慢

2) atm 我使用 5 个查询并一次发送它们以获取所有需要的数据,连同最后一次编辑中的查询,查询时间都可以。

到查询:

关键是对cte中的站点进行排名,然后加入cte。 这种方式排序要快得多。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-03-14
    • 2012-03-17
    • 1970-01-01
    • 1970-01-01
    • 2014-10-14
    • 1970-01-01
    • 2011-01-26
    相关资源
    最近更新 更多