【发布时间】:2016-10-16 18:46:03
【问题描述】:
编辑:跳到当前状态的最后一次编辑
你好!
我有一张带气象站的桌子
车站:
id,
point, (geometry(Point,4326))
ctry (country code)
还有一张带有天气数据的表格:
诺阿:
id | integer | not null default nextval('noaa_id_seq'::regclass)
usaf_wban | text |
station_id | integer |
usaf | integer |
wban | integer |
dt | timestamp without time zone | not null
point | geometry(Point,4326) |
air_temp | double precision |
dew_point | double precision |
relative_humidity | double precision |
sea_level_pressure | double precision |
pressure | double precision |
wind | double precision |
cloudiness | double precision |
ghi | double precision |
还有另一个locations_location,我明白了
我已经对索引进行了很多实验,noaa 表上的当前索引是:
Indexes:
"noaa_pkey" PRIMARY KEY, btree (id)
"noaa_dt_trunc" btree (date_trunc('hour'::text, dt))
"noaa_point" gist (point)
"noaa_station_ids" btree (station_id)
现在我想为每个参数选择(air_temp,wind ..) 此参数不为空且不为 9999 的最近点
此时我使用了 5 个如下所示的单个查询:
with postal_station AS (
SELECT id as station_id, s.point FROM stations s WHERE s.ctry = 'AU'
ORDER BY s.point <-> (
SELECT point FROM locations_location l
WHERE l.postal_code = '9201' AND l.country_code = 'AT'
LIMIT 1
)
LIMIT 5
)
SELECT
DISTINCT ON (date_trunc('hour', dt))
date_trunc('hour', dt) as dt,
cloudiness
FROM
noaa n
WHERE
dt BETWEEN '2010-01-01'::timestamp AND '2015-01-01'::timestamp
AND
NOT cloudiness = 9999
AND
NOT cloudiness is null
AND
n.station_id IN (SELECT station_id FROM postal_station)
ORDER BY dt, point <-> ( SELECT point FROM postal_station LIMIT 1 )
这相当快~150ms,唯一使用的索引是 noaa_station_ids
但目前我将 station_ids 的限制增加到 5 左右:
with postal_station AS (
SELECT id as station_id, s.point FROM stations s WHERE s.ctry = 'AU'
ORDER BY s.point <-> (
SELECT point FROM locations_location l
WHERE l.postal_code = '9201' AND l.country_code = 'AT'
LIMIT 1
)
LIMIT 6
)
SELECT
DISTINCT ON (date_trunc('hour', dt))
date_trunc('hour', dt) as dt,
air_temp
FROM
noaa n
WHERE
dt BETWEEN '2010-01-01'::timestamp AND '2015-01-01'::timestamp
AND
NOT air_temp = 9999
AND
NOT air_temp is null
AND
n.station_id IN (SELECT station_id FROM postal_station)
ORDER BY dt, point <-> ( SELECT point FROM postal_station LIMIT 1 )
https://explain.depesz.com/s/9n2M
索引 noaa_station_ids 不再被使用,查询大约需要 ~2429 毫秒
所以这是我的问题:
如果“n.station_id IN”子句包含超过 5 个值,为什么不使用索引 noaa_station_ids?
有没有办法在合理的时间内在一个查询中选择所有需要的值?
感谢您的阅读:)
PS:启用了 postgis 的 Postgres 9.5
编辑:实际上 cte 应该看起来像这样以获得正确的订购点..但这是一个附带的事情
with postal_point AS (
SELECT point FROM locations_location l
WHERE l.postal_code = '9201' AND l.country_code = 'AT'
LIMIT 1
),
postal_station AS (
SELECT id as station_id, s.point FROM stations s WHERE s.ctry = 'AU'
ORDER BY s.point <-> ( SELECT point FROM postal_point )
LIMIT 5
)
编辑:在 freenode 上加入 #postgresql 后 RhodiumToad 帮助我构建了这个查询
with postal_station AS (
select
s1.*
from (
select point from locations_location l where l.postal_code = '9201' AND l.country_code = 'AT' limit 1
) l0,
lateral (
select s.id, rank() over (order by s.point <-> l0.point)
from
stations s
where
s.ctry = 'AU'
order by s.point <-> l0.point limit 20) s1
)
SELECT
DISTINCT ON (date_trunc('hour', dt))
date_trunc('hour', dt) as dt,
air_temp
FROM
noaa n
JOIN
postal_station p
ON
p.id = n.station_id
WHERE
dt BETWEEN '2010-01-01'::timestamp AND '2015-01-01'::timestamp
AND
NOT air_temp = 9999
AND
NOT air_temp is null
ORDER BY dt, p.rank
即使有更多站点,速度也快约 200 毫秒 => https://explain.depesz.com/s/kA8
我会在几天后将此帖子标记为已回答。
仍然欢迎优化。
【问题讨论】:
-
注意:
noaa table的定义不包含dt和station_id列。请将您的表格的真实表格定义添加到您的问题中。
标签: postgresql