【发布时间】:2019-09-02 12:08:26
【问题描述】:
我正在使用 AWS Athena,我正在尝试合并所有具有特定列且 levenshtein_distance 值低于 5 的行,并对标准化百分比求和。
表格结构如下:
CREATE EXTERNAL TABLE `actions`(
`id` string COMMENT 'from deserializer',
`text` string COMMENT 'from deserializer',
`normalizedpercentage` float COMMENT 'from deserializer',
`timestamp` timestamp COMMENT 'from deserializer')
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://xxxxxx/db/actions'
TBLPROPERTIES (
'has_encrypted_data'='false',
'transient_lastDdlTime'='1566991410')
这是我想做的:
WITH t AS
(SELECT id,
text,
normalizedPercentage
FROM actions
WHERE actions.timestamp
BETWEEN timestamp '2019-08-01 00:00:01'
AND timestamp '2019-08-31 23:59:59' )
SELECT *,
SUM(normalizedPercentage)
OVER (PARTITION BY levenshtein_distance(text, EVERY_OTHER_TEXT_COLUMN) < 5) AS cumulative
FROM t
很遗憾,PARTITION BY 子句只接受列名。
我正在考虑定义一个函数并使用它来遍历所有行,但是这在 Presto 中似乎是不可能的。
【问题讨论】:
标签: sql presto amazon-athena