【问题标题】:How to enforce random selection of rows from each of the different countries/cities in PostgreSQL?如何在 PostgreSQL 中强制从每个不同的国家/城市随机选择行?
【发布时间】:2022-01-07 20:23:43
【问题描述】:

我正在 dbeaver 中开发 PostgreSQL。该数据库有一个列addr:country 和一个列addr:city。数据大约有 5 亿行,所以我必须随机抽样进行测试。我打算随机选择 1% 的数据。但是,数据本身可能存在很大偏差(因为有大国和小国,因此大国的行数更多,而小国的行数更少),我正在考虑一种公平抽样的方法。所以我想从每个国家的每个城市中随机选择一两行。

我使用的脚本是根据别人的查询修改的,我的脚本是:

SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version
    ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
    COUNT(*)
    OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt"
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT NULL

返回错误信息:SQL Error [42601]: ERROR: syntax error at or near "(" Position: 1683

我对 SQL 很陌生,所以脚本中可能有很多错误。有没有办法强制从每个addr:city 中的每个addr:country 中随机选择一/两行?

【问题讨论】:

  • 您在osm_version 之后缺少,

标签: sql postgresql postgis


【解决方案1】:

您可以使用窗口函数dense_rank() 对分区中的记录进行随机编号:

with base_data as 
(
SELECT osm_id, way, tags, way_centroid, way_area, calc_way_area, area_diff, area_prct_diff, calc_perimeter, calc_count_vertices, building, "building:part", "type", amenity, landuse, tourism, office, leisure, man_made, "addr:flat", "addr:housename", "addr:housenumber", "addr:interpolation", "addr:street", "addr:city", "addr:postcode", "addr:country", length, width, height, osm_uid, osm_user, osm_version,
    ROW_NUMBER() OVER ( PARTITION BY "addr:country", "addr:city" ) AS "cell_rn",
    COUNT(*) OVER ( PARTITION BY "addr:country", "addr:city") AS "cell_cnt",
    dense_rank() over (partition by "addr:country", "addr:city" order by random()) as ranking,
FROM osm_qa.buildings
WHERE "addr:city" IS NOT NULL
AND "addr:country" IS NOT null
)
select 
*
from base_data
where ranking between 1 and 2

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-02-15
    • 2017-08-27
    • 2018-03-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多